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[57] ABSTRACT 

The present invention provides a document processing 
apparatus, document processing method and a storage 
medium for storing thereof on purpose to offer document 
filing in which document can be registered with a little 
computation cost and with high speed, and retrieval can be 
performed with little oversight. In the document processing 
apparatus, a similar character classifying element classifies 
characters i n a document image into similar character cat- 
egories _in advance and stores the classified categorie s 
togeth er with their representative image features . When the 
document image is registered, a pseudo character recogniz- 
ing element executes, without identifying each character in 
the text region, classification into character categories based 
on the image features less than those used in the ordinary 
character recognition and stores the category strings gener- 
ated by identifying each character with the inputted image. 
In retrieval, a retrieval executing element converts each 
character in the retrieval keyword into nearest category, and 
retrieves a document including the converted category string 
as a part as a result of retrieval. 
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Fig. 3 
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Fig. 5 
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FIG. 10 
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FIG. 14 
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FIG. 15 
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METHOD AND APPARATUS FOR IMAGE 
BASED DOCUMENT PROCESSING 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a document processing 
apparatus which reads and stores a document as an image 
and in particular relates to a document processing apparatus 
having a retrieval function for retrieving a content in a text 
from a document image. 

2. Discussion of the Related Art 

Document filing systems capable of converting a docu- 
ment into an image by an image input device such as an 
image scanner, storing thereof electronically and carrying 
out retrieval later have been put to practical use. However, 
many of such systems have required manual assignment of 
attributes for the retrieval using keywords or the like per 
every inputted image; therefore much labor has been nec- 
essary. 

In the document retrieval, originally, it is desirable to 
carry out full-text retrieval based on the contents of the text. 
It is possible to execute full-text retrieval for an electronic 
document prepared by the desktop publishing (DTP) or the 
like, but it is impossible to carry out the full-text retrieval 
directly on the document image. Therefore, in Japanese 
Patent Application Laid-Open No. 62-44878 (1987), for 
example, it is disclosed that character recognition is per- 
formed on the text portion in a document, and the full-text 
retrieval is made to be possible by coding the text contents. 
Moreover, candidates for each character obtained in the 
process of character recognition are retained so that the 
oversight in retrieval caused by the recognition error is 
reduced. However, in the character recognition, and in 
particular in the character recognition of a document written 
in Japanese which has a large number of character types, 
feature vectors of several hundreds of dimensions are 
obtained and tried to match with the features of not less than 
approximately 3,000 character types; accordingly, the 
matching process of the feature vectors requires much 
computation cost. Besides, there is a problem of possibility 
that a retrieval keyword is incorrectly recognized because 
the rate of character recognition is not so high. Japanese 
Patent Application Laid-Open No. 62-285189 (1987) dis- 
closes an invention which obtains a character string well- 
formed as Japanese by utilizing a morphological analysis 
after character recognition, and automatically corrects the 
incorrectly recognized characters. In an invention disclosed 
in Japanese Patent Application Laid-Open No. 5-54197 
(1993), Japanese characters are replaced with representative 
characters to reduce the character types to be dealt with, and 
then words are identified by utilizing a rate transition matrix 
for correcting the incorrectly recognized characters. 
However, these inventions basically require much compu- 
tation cost in registration of documents for execution of 
character recognition, and if the ultimately desired object is 
a document image including the word designated in the 
retrieval, execution of character recognition would be 
mostly result in vain. 

According to "Keyword Search for Japanese Image Text", 
Yusa et al., Information Media, 19-1, January 1995, features 
of each character image are directly converted into the 36-bit 
codes instead of execution of character recognition on the 
features obtained from each character image, and features of 
a retrieval keyword image is also extracted for feature 
matching, and thereby the character string retrieval is per- 
formed using the codes. However, it is necessary to input the 
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retrieval keyword as an image or to generate an image by 
using character font image corresponding to the keyword, 
that is, there is a problem of weakness in the difference of the 
fonts used in the document image. 

5 In "Document Reconstruction: A Thousand Words from 
One Picture", Reynar J. et al., in Proc. of 4th Annual 
Symposium on Document Analysis and Information 
Retrieval, pp. 367-384, Las Vegas, April 1995, it is dis- 
closed an attempt that characters in a text image in a 

10 language of European origin (English) are classified into a 
small number of categories based on their sizes and 
positions, and identified as words according to the sequence 
of the categories. U.S. Pat. No. 5325,444 (1994) or 5,438, 
630 (1995) discloses a technology which measures fre- 

15 quency of occurrence of a specific word and identifies a 
word without using an OCR by utilizing an image feature 
per word unit called "Word Shape". However, it is difficult 
to intuitively find a feature to be a key for a language having 
a large number of character types such as Japanese or 

20 Chinese. Besides, it is impossible to directly obtain word 
units from an image because, different from the European 
origin languages, there is no physical space between the 
words on the image. For this reason, it is difficult to directly 
identify the words in a text written in Japanese or the like 

25 according to the disclosed method. 

Japanese Patent Application Laid-Open No. 4-199467 
(1992) discloses an invention which carries out grouping 
character types apt to be recognized incorrectly with each 
other and assigns a character code to each group, which is 
used in retrieving. In this method, character codes are once 
obtained by executing a character recognition process, and 
then converted into those indicating the groups. Therefore, 
oversight in retrieval is prevented by the grouping, but much 
computation cost and the time for character recognition are 

35 still required. 

Japanese Patent Application Laid-Open No. 7-152774 
(1995) discloses a technique in which, if a character apt to 
be incorrectly recognized is included in character strings in 

^ the retrieval condition expression, plural candidates for the 
retrieval condition expression are prepared for execution of 
retrieval. Furthermore, in an invention disclosed in Japanese 
Patent Application Laid-Open No. 6-103319 (1994), if there 
are characters cannot be converted normally, they are left 

4S indefinite and retrieval is executed for such indefinite data. 
According to these techniques, oversight in retrieval can be 
reduced, but these techniques also require much computa- 
tion cost and time for the character recognition. 

SUMMARY OF THE INVENTION 

50 

The present invention has been made in view of the above 
circumstances and has an object to provide a document 
processing apparatus, document processing method and 
storing medium for storing thereof to offer document filing 

55 which executes a registering process with a little computa- 
tion cost and with high processing speed when a document 
is registered, and realizes retrieval with little oversight. 

Additional objects and advantages of the invention will be 
set forth in part in the description which follows and in part 

60 will be obvious from the description, or may be learned by 
practice of the invention. The objects and advantages of the 
invention realized and attained by means of the instrumen- 
talities and combinations particularly pointed out in the 
appended claims. To achieve the objects and in accordance 

65 with the purpose of the invention, as embodied and broadly 
described herein, a document processing apparatus of the 
present invention comprises a character category storing 
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element for storing a category of a similar character made by FIG. 14 is a flow chart showing an example of a process 

classification of characters based on an image feature of of a retrieval executing element in the first embodiment of 

each character with relation to the image feature, a text the document processing apparatus according to the present 

region extracting element for segmenting an image of every invention; 

character in an inputted i image a pseudo characteLKCQg- 5 FIG. 15 illustrates an example of a character code- 

S^T^ f™*™ lh l, m T fltf« *"y category correspondence table in the case where grouping 

segmeriteji^y_theaexLre •♦11 . ■ . „ . . c . ... i\ u 5 

categS^red.iD>the^chrracter eatery storing elTment mt0 P^al categones is permitted m a first vanation of the 

based on the image featurTTelalTdlol he category, a p seudo first embodiment of the document processing apparatus 

character recognition result storing element for storing the according to the present invention; 

category into which the image of every character is classi- FIGS. 16(A) and 16(B) illustrate a concrete example of 
fied by the pseudo character recognizing element with positions of segmentation in the case where plural character 
relation to the input document image, a keyword converting extraction interpretations are possible in a second variation 
element for converting each character in a retrieval expres- of the first embodiment of the document processing appa- 
sion input for retrieval into the nearest category stored in the ratus according to the present invention; 
character category storing element, and a document retriev- « FIG. 17 illustrates relation between the segmented char- 
ing element for retrieving document images having a cat- acter strings in the case where plural characterlegmentaUon 
egory generated by converting the retrieval expression by . . . fr ... . F 4 , . . f. 
4iT 1 j t . /* . j , interpretations are possible in the second variation of the 
the keyword converting element from the pseudo character ~ y , ' V 1 

recognition result storing element. first embodiment of the document processing apparatus 

„ accordmg to the present invention; 

BRIEF DESCRIPTION OF THE DRAWINGS 20 nG 18 mustIates aQ cxafflple of , char . 

The accompanying drawings, which are incorporated in acter code table in the case where plural segmentation 

and constitute a part of this specification illustrate embodi- interpretations are permitted in the second variation of the 

ments of the invention and, together with the description, first embodiment of the document processing apparatus 

serve to explain the objects, advantages and principles of the 25 according to the present invention; 

invention. In the drawings: pj G 19 fc a flow chart ^ example of a process 

FIG. 1 shows the construction of a first embodiment of of preparing the representative character code table in the 

document processing apparatus according to the present case where plural segmentation interpretations are permitted 

invention; in me secon d variation of the first embodiment of the 

FIG. 2 is a flow chart showing an example of a process in 30 document processing apparatus according to the present 

similar character classifying element in the first embodiment invention; 

of the document processing apparatus according to the FIG. 20 illustrates an example of a bi-gram table in the 

present invention, case w h ere pi ura l segmentation interpretations are permitted 

FIG. 3 illustrates a peripheral feature; m the variation of the first embodiment of the 

FIG. 4 is a flow chart showing an example of a process of 35 document processing apparatus according to the present 

hierarchical clustering; invention; 

FIG. 5 is a flow chart showing an example of optimization piG. 21 shows the construction of a second embodiment 

process of clustering; of the document processing apparatus according to the 

FIG. 6 is an example of a similar character category table present invention; 

in the first embodiment of the document processing appa- 40 FIG 22 mustrales ^ examp i e of a category word dictio- 

ratus accordmg to the present invention; nary - m lhe ^ cond embodiment of the docume nt processing 

FIG. 7 is an example of a character code-category corre- apparatus according to the present invention; 

spondence table in the first embodiment of the document FIG. 23 illustrates another example of the category word 

processing apparatus accordmg to the present invention; ^ k ^ of ^ 

FIG. 8 is a flow chart showing an example of a process of processiDg apparatus according to the present invention; 

pseudo character recognition in the first embodiment of the „ ir , ~ A ... _ , c , . 

A~™ m » n t ^ccn, ,n„,„t„ r _ t , FIG. 24 illustrates an example of a code conversion table 

document processing apparatus according to the present ... . * j. A * , , 

invention- m the second embodiment of the document processing 

inr-e n/A\ a ft m\ 11 . , , c i. r apparatus according to the present invention; 

FIGS. 9(A) and 9(B) illustrate an example of a result of cn - m , * a . « 

character region extraction in the first embodiment of the 50 FIGS 25 / nd 26 m flow charts showm 8 exam P le of 

document processing apparatus according to the present operaUon of a category word detecting element in the second 

invention - embodiment of the document processing apparatus accord- 

FIG. 10 is a flow chart showing an example of a process l ° thc prcsent invcntion ' 

of conversion to a representative character code string in the 55 FIG - 27 illustrates an example of a dictionary of part of 

first embodiment of the document processing apparatus speech connection in the second embodiment of the docu- 

according to the present invention; * meQt processing apparatus according to the present inven- 

FIGS. 11(A) and 11(B) illustrate an example of a result of tion; 

the process of conversion to the representative character FIG. 28 is a flow chart showing an example of a process 

code string in the first embodiment of the document pro- 60 of verifying relation of part of speech connection in the 

cessing apparatus according to the present invention; second embodiment of the document processing apparatus 

FIG. 12 illustrates an example of bi-gram table in the first according to the present invention; 

embodiment of the document processing apparatus accord- FIG. 29 shows an example of incorrect recognition of the 

ing to the present invention; representative character code; 

FIG. 13 illustrates an example of a representative char- 65 FIGS. 30(A) and 30(B) illustrate an example of conver- 

acter code table in the first embodiment of the document sion into representative character code string in the case 

processing apparatus according to the present invention; where N=l and N=2 in a first variation of the second 
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embodiment of the document processing apparatus accord- These two tables are sufficient for actual document regis- 

ing to the present invention; tration and retrieval, and therefore, the processes here are 

FIG. 31 shows the construction of a second variation of executed only in advance of inputting the document image, 

the second embodiment of the document processing appa- The similar character category table stores the sets of 

ratus according to the present invention; 5 character codes representing a category, a character code of 

FIGS. 32(A)-32(E) illustrate an example of features used one or morc characters actually belong to the category and 

for preparing a detail identification dictionary in the second an UDa S c featurc vcctor representing the category. The 

variation of the second embodiment of the document pro- character code-category correspondence table is a reverse 

cessing apparatus according to the present invention; teble of me similar character category table and is used for 

FIG. 33 is a flow chart showing an example of procedures 10 ^ETlS ke ^° rf kt0 ^ e re P resentative char " 

in preparing the detail identification dictionary in the second ^ o ^ m , J>' . . . 

variation of the second embodiment of the document pro- The pseudo character recogn^mg element 12 extractstext 

. . . * regions from the inputted document image, classifies char- 

cessing apparatus according to the present invention; CQR ^ Qed ^ ^ of ^ ^ ^ » '.^ ^ _ 

FIG. 34 illustrates an example of the detail identification 15 character categories, assigns representative character codes 

dictionary in the second variation of the second embodiment to thc categories, and stores them with the positions of the 

of the document processing apparatus according to the corresponding characters in the image in the external storage 

present invention; device 7. 

FIG. 35 is a flow chart showing an example of a process The retrieval executing element 13 requests the user to 

of a detail identifying element in the second variation of the 20 type in the retrieval expression, and if the retrieval expres- 

second embodiment of the document processing apparatus sion is inputted, the element 13 converts the keyword 

according to the present invention; included in the retrieval expression into the representative 

FIG. 36 illustrates an example of relation between seg- character code string of the category by the character code- 

mented character strings in a third variation of the second category correspondence table, fetches the document images 

embodiment of the document processing apparatus accord- 25 including the code string of the converted keyword, and 

ing to the present invention; shows it with the position of the retrieved keyword to the 

FIG. 37 illustrates another example of relation between uscr 

the segmented character strings in the third variation of the DetaUs of P rocesses of each element are now explained. 

second embodiment of the document processing apparatus FIG ' 2 * a flow chart showing an example of the process of 

according to the present invention; and 30 ^ simila r character classifying element. The similar char- 

mr<z in * n A io , M a™ ~u n ~*<. ou. • i c acter classifying element 11 constructs the similar character 

MGS. 38 and 39 are now charts showing an example of . . , , , 

a process of integrating the segmented character strings in T % °*?u k ™a , code-category correspon- 

the third variation of the second embodiment of the docu- ^ f °° ° f dm ^ C T 

. . j. . included in each of the similar character categories as the 

meat processing apparatus according to the present inven- 35 ^ ^ of ^ 

images and character codes corresponding thereto. The 

DETAILED DESCRIPTION OF THE training samples of various fonts, having different threshold 

PREFERRED EMBODIMENTS values for binarization, and so forth are prepared for all 

character types. 

Preferred embodiments of a document processing appa- 40 First , in step 21, normalization of the size of each char- 

ratus according to the present invention are now described in acter imag e is executed as a prep rocess. Here, it is assumed 

detail based on the drawings. that the normaliz ed size is 64x64 (pixels). Next, the featu re 

First Embodiment e xtraction is earned out. The peripheral feature is used^he re. 

FIG. 1 shows the construction of a first embodiment of a wh ich is illustrated in FIG. 3 . As shown in the figure, 

document processing apparatus according to the present 45 scanning is started from each side of a circumscribing 

invention. In the figure, a keyboard 3 and mouse 4 for rectangle of the character to take a distance from the starting 

directing operation, a display 2 for showing a result, an point to the point at which a white pixel changes to a black 

image scanner 5 for inputting a qfoaimsnt, a printer 6 for pixel as the feature, wherein the first changing position and 

printing and outputting the result, an external storage device the second changing position are extracted. He re, it-J s 

7 for storing programs or data to be processed, and so on are 50 assujiied-th aLcharacter image is divided into 8 re gionsJn 

connected to a processor 1. The processor 1 executes pro- eacqjaLllieJaQjia^M vertical direc tion to 

cesses actually according to software stored in the external be_sc anned, and the feature^vectors of 8x4x2, totaT"6 4 

storage device 7. The processor 1 may be, for example, an dime nslonTare extracted . FIG. 3 shows the case where the 

ordinary computer. As the external storing device 7, a hard scanning is started from the left side of the circumscribing 

disk capable of quick access is adopted, for example. Or, the 55 rectangle, and the scanning locus from the starting point to 

external storage device 7 may be a mass storage device such the point at which the white pixel firstly changes to the black 

as an optical disk for retaining a large amount of document pixel is indicated by a broken arrow. In ordinary character 

images, recognition, other features are used together to improve the 

The_ processor 1 executes the software, which oonsist s.of precision of recognition. However, sufficient precision is 

a^sj milar cha r acter classifrm gjete ment 11 , a pseudo char- 60 expected with the feature vectors of the small number of 

acter recognizing element 12 and retrieval executing ele- dimensions because it is enough to classify the characters 

ment 13. The similar character classifying element 11 clas- into the small number of similar character categories. It may 

sifies object characters into categories each of which be possible to generate the feature vectors by extracting 

consists of similar characters based on features of the image. other features instead of, or together with the peripheral 

Here, a similar character category table which is necessary 65 feature. 

for registration of document and a character code-category If the feature vectors are obtained for each character of the 

correspondence table necessary for retrieval are generated. training sample, an average of the feature vectors which are 
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the different fonts or different threshold values for binariza- 
tion for the same character type (for example, is 
calculated, and thereby a representative vector is generated 
for each character type. If distances among the representa- 
tive vectors of plural characters are small in a feature space, 
these plural characters are the similar characters. In step 23, 
a clustering process, namely, grouping the characters whose 
representative vectors are close together, is carried out. For 
clustering, a method such as described in "Pattern Classifi- 
cation and Scene Analysis", Duda and Hart, Wiley- 
Interscience, can be used. According to this method, the 
hierarchical clustering is first carried out and the result is 
assumed to be first clusters, and then optimization is per- 
formed so that the sum of squares of differences between the 
centers of gravity of each of the clusters and the feature 
vectors corresponding thereto becomes minimum. 

FIG. 4 is a flow chart showing an example of process of 
the hierarchical clustering. In step 31, it is assumed that the 
desired number of clusters is m, the total number of char- 
acter types is n, the first cluster is X={cji=l, . . . , n} and in 
c if the representative feature vector of similar character types 
is retained. As the initial value of c^, the representative 
feature vector of each of the character types is inputted one 
by one. In step 32, the current number of clusters and the 
desired number of clusters m are compared with each other. 
If the current number of clusters is equal to m, X at that time 
is determined to be the result of the clustering and the 
process is finished. Otherwise, the process proceeds to step 
33. In step 33, a pair of clusters having the shortest distance 
d between in the feature space is found and they are 
integrated into one cluster, and then the process returns to 
step 32. 

The desired number of clusters m can be arbitrarily given, 
but it is assumed to be 500 here. JIS level- 1 kanji set has 
3,000 character types approximately, and accordingly, one 
cluster has 6 character types on average. In this process, 
various methods for calculation of the distance d between 
the clusters may be considered. Here, a method is adopted 
such as extracts arbitrary one feature vector from each of 
two clusters to make a pair of vectors, and assumes the 
shortest distance between the pairs of vectors among those 
generated as described to be the distance of the two clusters. 

Since the result of the hierarchical clustering cannot be 
considered to be the optimum one, optimization of the 
cluster is executed in step 24 of FIG. 2 based on the result 
of the hierarchical clustering as a starting point. For 
optimization, the sum of squares of distances between the 
average value of feature vectors in each cluster and each the 
feature vector is calculated and the sum total of squares as 
to all clusters is regarded as a decision function. The smaller 
the value of the decision function is, the better the clustering 
is, because it means the cluster is more packed with the 
feature vectors. In general, it is difficult to find the clustering 
which minimizes the value of the decision function, but 
pseudo optimization is possible. 

FIG. 5 is a flow chart showing an example of the process 
of optimization of the clustering. First, in step 41, an 
arbitrary feature vector x is extracted. In step 42, a cluster to 
which the feature vector x currently belongs is assumed to 
be c 4 and whether the feature vector registered therein is only 
x or not is determined. If the registered feature vector is only 
x, the process returns to step 41. Otherwise, one of the 
following calculations is executed on all clusters c y . 

when j*i 
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when j=i 

wherein n f is the number of feature vectors registered in c y , 

5 and m y is the average of the feature vectors belonging to c ; .. 
The above expressions represent the amount of changing of 
the decision ftinction in the case where the feature vector x 
is transferred to c y . 

In step 44, it is determined whether j making the calcu- 

10 lated value a the minimum is other than i or not, and if j 
making the value a the minimum is other than i, the feature 
vector x is transferred to the cluster c ; in step 45. 

In step 46, it is determined whether it has already been 
impossible to transfer any feature vector between clusters, 

15 and if the transfer is still possible, the process returns to step 
41 to repeat the processes in step 42 and subsequent thereto, 
assuming the next feature vector to be x. If it is determined 
that the transfer of all feature vectors between clusters has 
been completed, the cluster at the time is regarded as the 

20 result, and the process is completed. 

As described above, clustering of similar characters is 
performed. In the processes shown in FIG. 5, various 
methods can be adopted for extracting an arbitrary character 
in step 41 and similar processes can be executed, and the 

25 clustering making the evaluation function (sum total of 
squares of distances between the average value of feature 
vectors in every cluster and each of the feature vectors) 
minimum can be adopted. 
2J Returning to FIG. 2, in step 25, the s imilar character. 

30 categ ory table is_gener ated bas ed on each"cluster jjid then 
stored, which is usedfor registration of a document. FIG. 6 
illustrates an example of the similar character category table 
in which each category consists of a representative character 
code of the category, codes of similar characters belonging 

35 to the category and a representative vector of the category 
feature. The category feature vector is the average of feature 
vectors of characters belonging to the category. As a repre- 
sentative character code of the category, arbitrary one is 
assigned which is selected from the character codes of 

40 similar characters belonging to the category. In FIG. 6, 
characters themselves are entered instead of the character 
codes. 

Further, in step 26, the character code -category corre- 
spondence table is simultaneously generated as the reverse 

45 table of the similar character category table for converting 
the retrieval keyword into the representative character code 
string in the retrieving process. FIG. 7 illustrates an example 
of the character code-category correspondence table. As 
shown in the figure, the character code-category correspon- 

50 dence table is generated by making the sets of a character 
code and a representative character code of the category 
corresponding to the character code. 

Next, the process of document registration carried out in 
the pseudo character recognizing element 12 is described. 

55 FIG. 8 is a flow chart showing an example of the process of 
the pseudo character recognizing element. At first, the user 
inputs a document desired to be registered as an image by 
utilizing a connected image scanner 5 or the like. In some 
cases, the document is transmitted through a facsimile or 

60 network and inputted. Here, it is assumed that a mono- 
chrome binary image is inputted, but it is possible to input 
the document as a gray-scale or color image one, and at the 
time of inputting to the pseudo character recognition 
process, convert it into the binary image by the threshold 

65 value processing. As a preprocess for the inputted binary 
image, noise removal, skew correction and so forth are 
executed. 
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In step 51, the character regions included in the binary 
image are extra cteoT^s this process, the region division 
methocl by peripheral distribution disclosed by "Region 
division of document image using characteristics of periph- 
eral distribution, linear density and circumscribing rectangle 
in combination", Akiyama and Masuda, The Transactions of 
the Institute of Electronics, Information and Communication 
Engineers D-II, vol. J69, No. 8, for example, can be used. As 
a matter of course, various methods have been proposed for 
the region division process, and the method is not limited to 
those based on peripheral distribution as described here. The 
region determined to be the graphical image is excluded 
from the object of processing. The divid ed character b lock 
regions aco_assi gned the_ numbers called block IDs in order 
as the rectangular regions, and stored in a memory. 

FIGS. 9(A) and 9(B) illustrate an example of the result of 
character regio n extracti on. FIG. 9(A) shows an example of 
an inputted document J mage: hatched portions represent the 
lines in which characters are in a row; and an x-ed portion 
represents the graphical image region. If such a document 
image ^inputted, for example, it is di vided in to the char- 
acte r block regions marked with bojHTrames ancfthe graphi- 
cal image rejrjoD as shown in fr'lli. "9(B)Tgndnrre-block IDs 
are assigned to the character block regions. In FIG. 9(B), the 
block IDs 1-6 are assigned. 

Returning to FIG. 8, in step 52, the character region is 
divided into lines, and further divid ed into chara cters. 
Regarding the character segmentation process, various 
methods have also been proposed, and any of them will 
suffice. 

3 In step 53, each s egmented chara cter image is converted 
into the represent afive c^aracter.cQ,de.o£ jheamn^chara C' 
ter category. FIG710 is a flow chart of an example of process 
of conveT^n-into-the.re presentative character code _stripg. 
At first, punctuation marks which cannot obviously be the 
retrieval keywords are extracted. In step 61, it is determined 
whether a character image is a punctuation mark or not. In 
the determination process, the punctuation mark satisfies the 
conditions that the width and height of the circumscribing 
rectangular of the character image are not more than the 
threshold values Tw and Th, respectively, the upper end of 
the circumscribing rectangular is lower than the center of the 
character line, and the distance to the right-adjacent char- 
acter is larger than the threshold value Tr. Since the width 
and height of the Japanese character are approximately 
equal, the threshold values Tw, Th and Tr may be set to, for 
example, Tw=Th=Tr=h/2, provided that the height of the 
character line is assumed to be h. To the character deter- 
mined to be the punctuation mark, the character category 
" O" indicating the punctuation mark is assigned in step 62. 

I f the charac texjniage^ is not determined to be the p unc-.~ 
tuation mark, its size is normaliz ed in~step~63~as comple tely 
same_a s the simi lar^tiar^CeT gla^iricatio n process , andlhe 
im ageJeamreu s calcula jecirThe peripheral feature has been 
extrac ted i n the similar character classification process ; 
therefore, the penpKeTalTeamTe"iscalculated here according 
thereto. Next, in step 64, it is determined to which of the 
simi lar charac te r_categories~the_feature_vector_ of_the 
unkaown characte r_belongs. That is, the Euclid's distances 
between the feature vector of the unknown character and the 60 
representative vectors of the sirrjilar ch aracter c at egories are 
calculat ed for comparison . The representative vectors can be 
usecT because they have been registered at the similar char- 
acter category table. In step 65, the similar character cat- 
egory whose Euclid's distance which has been calculated is 
the minimum is adopted as the character category, and its 
representative character code is outputted as the result. The 



identification method utilizing the minimum distance is used 
here for simplification, but there are other various identifi- 
cation methods, and any of them may be used. 
jAFIGS. 11(A) and 11(B) illustrate an example of process of 
5 conversion into the representative character code string. If it 



is assumed that the inputted character images are 
XS@ «»*f ..." as showVirTFIG. 11(A), the 
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character image " X" is segmented at the beginning and its 
feature vector is obtained. Next, the distances from the 
representative vector of each of the categories stored in the 
similar character category table are calculated, and the 
representative character code of the category having the 
minimum distance is assigned thereto. For example, if the 
similar character category table as shown in FIG. 6 is 
registered, the result of conversion of all character images 
into the representative character codes in order is the string 
of the representative character codes of the categories "... 

xmmm& 

An ordinary character recognition is not carried out here, 
and merely the matching with the small number of character 
categories is executed by utilizing the feature vectors of a 
small number of dimensions. Though the similar character 
codes are registered at the similar character category table, 
the similar character codes are not used at this time because 
the character recognition is not carried out here. 

In this way, in the process of conversion into the repre- 
sentative character code string, the processing speed can be 
greatly improved because only the matching process with 
the small number of character categories is necessary. The 
matching uses the Euclid's distance, and the amount of 
calculation is approximately in proportion to the number of 
dimensions of the feature vectors and the number of iden- 
tification categories. Now if it is assumed that the number of 
character types which are the objects of identification is 
3,000, the number of similar character categories is 500, the 
number of dimensions of the feature vectors in the case of 
the ordinary character recognition is 300 and the number of 
dimensions of the feature vectors in the case of this method 
is 64, the amount of calculation for matching is not more 
than l /2B of the amount of calculation in the ordinary char- 
acter recognition in total. As the method of accelerating the 
speed of character recognition for Japanese characters, a 
hierarchical identification method such that some tens to 
some hundreds of similar character types are extracted by 
utilizing the feature vectors of the small number of dimen- 
sions (rough classification), and more detailed identification 
is executed by utilizing the feature vectors with more 
dimensions (detailed classification) is known. If it is 
assumed that even the vectors of the number of dimensions 
same as those of the method used in this embodiment is 
utilized for the process of the rough classification in such a 
method described above, matching with all character types 
(3,000) is necessary, and besides, the detailed classification 
is further required; accordingly, the amount of calculation in 
total is not more than Ms of the amount of calculation in the 
ordinary character recognition. 
>>' u Returning to FIG. 8, it is ineffective to directly search for 
the repr esentative character code string obtained in step 5 3 
in the process of retrieving; therefore, indexes for retrieval 
are~prepared and Their Contents are updated whenever any 
document is reg istered. Here, the index according to bi-gram 
is aBoptecl, and registration at the bi-gram table is executed 
in step 54. The bi-gram refers to apartiaLrharartPr string 
consisting of two successive characters in a character string. 
That is, in the case of the character string 
bi-grams "*H", "««", and "*8If" are 
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obtained. The bi-grams of all representative character code 
strings are extracted and made to be the indexes of a table, 
and then the document image ID and the position of the 
bi-grams in the representative character code string are 
stored. -^is 

FIG. 12 illustrates an example of the bi-gram table. In the 
figure, the bi-gram table of the representative character code 
string " X&W &$ff" obtained corresponding to the character 
string " j&tRtt" used in the above example is shown. 
The bi-gram table shown in FIG. 12 consists of two stages. 
The first stage regards the bi-gram as a key and stores a 
pointer to a table showing the content of the bi-gram, and the 
table indicated by the pointer consists of sets of the docu- 
ment ID, the block ID indicating which region the bi-gram 
belongs to, and the character position, and whenever the 
corresponding bi-gram is found in the character block in an 
inputted_ document, ijt se ntry is ad ded. The bi-gram table can 
be implemented by the technique which is publicly known, 
such as B-tree or Hash table using the bi-gram as a key, and 
thereby high-speed retrieval is available. For the s character 
imag e dete rmined to be the punctuation mark, no bi-gram is 
generated. 

Returning to FIG. 8, in step 55, the representative charH 
acter code string obtained in step 53 is registered at the I 
representative character code table per every character block j 25 
with its position in the image, and stored in the external 
storage device 7 or the like with the inputted image. FIG. 13 
illustrates an example of the representative character code 
table which makes sets of each representative character code 
and the position of the rectangle where the character code 
occupies in the image, and stores them. In FIG. 13, the 
characters themselves are entered in the table instead of the 
representative character codes. The position of the rectangle 
where the character'oode occupies in the image is repre- 
sented by (top-left x-coordinate, top-left y-coordinate, 
width, height). By execution of the above procedures, the 
registration process for the inputted document image is 
completed. 

t\ly Finally, the retrieving process in the retrieval executing 
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element 13 is described. FIG. 14 is a flow chart showing an 
example of the process in the retrieval executing element. 
The retrieval executing element 13 waits until the user 
inputs a retrieval expression. As the user inputs the retrieval 
expression by the keyboard 3, for example, while watching 
the display 2, the retrieval executing element 13 reads the 
inputted retrieval expression in step 71. Though various 
forms are possible for the retrieval expression, but it is 
assumed here that the retrieval expression is formed with 
retrieval keywords combined by the Boolean operat ors such j 
a sGRTXNITor logical NOT . 

""Alter the retrieval expression is read, the keywor ds in the - 
retrieval expression are extracted by parsing the retrieval 
expression in step 72, and the^Ee^words in the retrieval 
expression are converted into the representative character 
code string of the category with reference to the character 



character code strings converted from the two keywords or 
not among those obtained from the registered document 
image, and if any, its position in the image is stored. 
Actually, in step 74, the bi-grams of the representative 
character code string corresponding to the keyword are 
generated, and in step 75, the generated bi-grams are 
retrieved from the above-described bi-gram table, and 
thereby the ID of the document image corresponding to the 
bi-grams and the positions of appearance of the bi-grams are 
obtained. If the retrieval keyword has at least three 
characters, it is necessary that plural bi-grams are generated 
and these bi-grams successively appear in the same charac- 
ter block of the same document. Accordingly, the positions 
of appearance of the bi-grams are traced in order as to the 
same block ID of the same document image ID, and the 
bi-grams which do not appear successively are deleted. 

In the example of the retrieval expression described 
above, the bi-grams " 3C£", " fclS" and " Bft" are generated 
from the keyboard and the keyboard "$Bf" is 

directly regarded as the bi-gram "For example, it is assumed 
that the bi-gram table as shown in FIG. 12 is generated. The 
bi-gram " is included in three documents whose docu- 
ment IDs are 00001, 00015, and 00023. In the document 
having the document ID 00001, the position of the bi-gram 
" 3tST in each of the block IDs 1 and 2 is directly followed 
by the bi-gram " SfcHf. However, in each of the documents 
having the document IDs 00015 and 00023, the bi-gram 
" is not subsequent to the bi-gram " 2LW?\ 
Consequently, it turns out that the document having the 
document ID 00001 includes the character string " X&W". 
The same process is executed as to the bi-gram " and 
finally the document ID of the document including the 
character string " ItWMfo" can be obtained. Since " $JFF" is 
a word consisting of two characters, it is sufficient to only 
examine the corresponding bi-gram table. In this way, the 
document IDs of the documents in which the retrieval 
keywords appear and its positions of appearance are 
obtained. 

1q nfy At last, logical operation in the retrieval expression is 
executed in step 76. That is, tne logical operation is executed 
on the set of the document IDs of the documents including 
each of the retrieval keywords, and finally the set of the 
document IDs which matches the retrieval expression is 
obtained. For example, if the sets of the document IDs 
including the representative character code strings 
and " JBffF" corresponding to the keywords are 
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code^ate/gory co rrespondenc e table. As a specific example 
the case where the retrieval expression is " -£*® M W is 
considered. Here, represents AND. This retrieval 
expression means the direction to retrieve the document 
image including both of the words « w and " 

These two keywords are converted into the corresponding 
representative character code strings, " XWMfr" and " *BBf ", 
respectively, with reference to the character code-category 
co rrespon dence table. 

Next, it is examined whether there is any representative 
character code string which includes the representative 



(OOOOT7TTO0T17~OO2O2) and (00001, 00054, 00202), 
respectivel)T"the-Tesu1t of AND on these sets is (00001, 
^~j50 00202). That is, each of the document images having the 
document IDs 00001 and 00202 includes both of the rep- 
resentative character code strings " X%MVk" and "*BJff". 

In step 77, the document images corresponding to the 
document IDs obtained as described above are taken out of 
the external storage device 7, for example, and displayed in 
order on the display 2 in step 78. Because the positions of 
the characters can be obtained from the representative 
character code table on the image stored with the image in 
accordance with the obtained block IDs and positions of 
60 characters, the corresponding characters are highlighted. 
The highlight display may be the black-and-white inversion 
display or the display with a distinguishable color in the case 
of the color display. If the user gives the direction as to 
printing after confirming the result, the document images are 
outputted to the printer 6. 

Next, a first variation of the first embodiment of the 
document processing apparatus according to the present 
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invention is described. The first variation is an improvement 
of the retrieving precision. As described in "A method for 
Composing the Extended Dictionary in which the Same 
Character is Involved in the Different Clusters for a Hier- 
archical Chinese Characters Recognition System", Ito et al., 
The Transactions of the Institute of Electronics, Information 
and Communication Engineers D-II, vol. J78, No. 6, pp. 
896-905, if the clustering is performed by utilizing the 
representative vector which is an average of feature vectors 
of the same character type, actual character images cannot 
be classified properly into categories corresponding thereto 
in some cases. To avoid this inconvenience, component 
e-component extension method disclosed in the above 
article can be used. That is, after clustering is performed by 
utilizing the representative vector of each character type, 
Euclid's distance between the feature vector of each of the 
character images of the testing sample and the representative 
vector of each category is calculated, and then a character 
code is assigned to aU categories which exist within the 
minimum Euclid's distance or the distance obtained by 
adding the scalar parameter e to the minimum Euclid's 
distance, and registers them as similar characters. If the 
value of e becomes larger, the precision of the pseudo 
character recognition is further improved. However, since 
the number of character codes included in one category is 
increased, there are more possibilities of outputting an 
incorrect result in retrieval. To determine an optimum value 
of e, a set of unknown character images which is different 
from the testing sample is prepared at first. Then the pseudo 
character recognition process is performed against various 
values of e by utilizing the extended similar character 
category, and as a result, e is set to the minimum value such 
that the character code is correctly included in the categories 
identified with all characters in the set of unknown character 
images. 

In this case, in the character code-category correspon- 
dence table for retrieval, plural similar character categories 
correspond to one character code. FIG. 15 illustrates an 
example of the character code-category correspondence 
table in the case where the classification into plural catego- 
ries is permitted. In the example of FIG. 15, for instance, the 
character "ffiffF" is classified into two categories: one of 
which has the representative character " £" and the other 
one has the representative character " In some cases, one 
character is classified into three or more categories, though 
not shown in FIG. 15. 

Because one character is classified into plural categories 
as described above, plural representative character code 
strings are possible for one keyword when the keyword in 
the retrieval expression is converted into the representative 
character code string. For example, if the contents of the 
character code-category correspondence table are as shown 
in FIG. 15, each of the characters " X" and " ffc" belongs to 
two categories "X, and "fc, respectively. In this 
case, the keyword used in the above example of retrieval 
expression " jfc#K is converted into four representative 
character code strings "XSIW, "5C*®^" 
and u All documents including at least one of 

those four representative character code strings are extracted 
and internally processed as the result of OR of these four 
keywords. By execution of these processes, retrieval can be 
carried out without oversight though the process time is 
increased a little. 
vj^\ Moreover, in the case where the plu ral categLQ rjes^corre- 
spondjp^ne^chajacter, the certainties of four keywords 
internally developed can be shown by storing the certainty 
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of categories also. For example, it is assumed that the 
p robability of identifying the character " ±" with the ca t- 
eg ories " 3C" and " g" is 0.7 and 0.3, respective! v^ andthe 
proha^ily^nf identifying th e character " ffe" with the cat- 
ego ries "(ft" and " are 0.8 and 0 1 2 a jes Bgciiv.ely. In this 
case, " appears in probability of 0.7x0.8=0.56, 
" $MM *&" in 03x0.8=0.24, a XWiMW* in 0.7x0.2=0.14 and . 
in 0.3x0.2=0.06. In this way, by rean^nging the 



dej/^lo^ie^keywords^jaJh^ 



daocumc 



certainty, it becomes 



possible to offer the retrieved document images to the user 
mihe^Qfdewilcertainty. The probability of identification of 
each character with the corresponding category can be 
calculated by counting in wh at rat e the characters of the 
same character type in the unknown character image set 
used for extending the category are included in the corre- 
sponding category. 

Next, a second variation of the first embodiment of the 
document processing apparatus according to the present 
invention is described. So far it is supposed that there is no 
error in the phase of character segmentation and each 
character is securely segmented, but in fact, a lot of errors 
occur in the character segmentation. In the case where a 
document includes only Japanese characters, a tixgfr ftitch is 
expected. However, in the case where some Enghsn~wofcis 
orlhe lie are possibly included in it, the Chinese character 
is often separated incorrectly into a left-hand radical and a 
right-hand radical if the text is written laterally. Needless to 
say, it may be assumed that one character is divided into two 
by reason of blu rring o f character images caused in scan - 
ning. 

If there are plural possible character segmentation posi- 
tions in some characters, representative character code 
strings, each of which includes the possible segmentation 
results may be represented. Supposing such a case, repre- 
sentation of the representative character code string as 
follows is now considered. Thi s isrealized by extending the 
representati ye character c ode table described in the first 
embodiment in the following way: - 

FIGS. 16(A) and 16(B) illustrate a specific example of the 
segmentation positions in the case where the plural inter- 
pretations of character segmentation are possible. Now it is 
assumed that the image to be the object of character seg- 
mentation process is "3t*epBll" as shown in FIG. 16(A). As 
to " X" and " characters are properly segmented because 
there is no segmentation position other than the space 
between the characters. However, the character " QT has one 
candidate for segmentation position which vertically con- 
50 sists of white pixels only, and "$T has two such candidates. 
There is of course a segmentation position between these 
two characters, and consequently total five partial characters 
(al, a2, bl, b2, b3) can be obtained from " GHfj" as shown in 
FIG. 16(B). 

>^ Integration of the above segmented characters into one 
characters is attempted. In the integration, the partial char- 
acters are processed from the left, wherein any item which 
does not exceed the threshold value of tbe width is regard ed 
as a character . As the threshold value of t he width, for 
example, the height of^the~line~h can bemused. In this 
example, there is nothing which can be integrated with the 
character " X"; therefore, " X" is directly registered as one 
character. The same is the case with the character " 
65 Regarding the character " QJ", two interpretations are pos- 
sible: one is the case where the partial characters al and a2 
are regarded as two characters; the other is the case where 
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they are regarded as one character. Integration of a2 with bl 
is not accepted because, if they are integrated, the result 
exceeds the threshold value of the width. Therefore, it is 
necessary to retain the two interpretations hitherto for the 
same character image region. In the same way, for bl and 
subsequent thereto, four interpretations ([bl], [b2], [b3]), 
([blb2], [b3]), ([bl], [b2b3]) and ([blb2b3]) are possible. 
Here, square brackets indicate that the inside partial char- 
acters are regarded as one character. 

FIG. 17 illustrates relations between the character strings 
segmented in the case where plural interpretations of char- 
acter segmentation are possible. That is, relations among the 
possible interpretations in attempting the integration of 
partial characters as described above are shown. In the 
figure, O indicates a pause of character segmentation, and 
□ indicates a unit to be regarded as one character. As 
mentioned above, al and a2 have two interpretations and 
bl-b3 have four interpretations, and the candidates seg- 
mented in accordance with each interpretation are arranged 
and connected by a line. In this example, total eight inter- 
pretations are possible and all of them are retained. 

FIG. 18 illustrates an example of the representative char- 
acter code table in the case where the plural interpretation of 
segmentation are permitted. To represent the plural interpre- 
tations as shown in FIG. 17, the representative character 
code table is divided into a master table and sub-tables. The 
master table is made by extending the representative char- 
acter code table shown in FIG. 13, whereby pointers to 
sub-tables representing interpretations, if there are plural 
interpretations of character segmentation, are stored in the 
column which indicated the position on the image before. If 
the plural interpretations are possible, the representative 
character code in the master table is set to zero in FIG. 18. 
The sub-table consists of right-hand partial character region 
from a certain position of segmentation, its position in the 
image, and the sub-table number connecting thereto. 

Regarding the character " shown in FIG. 16(B), the 
character segmentation positions are at the left of the partial 
character al and the left of the partial character a2. The 
numbers are assigned to the sub-tables from the leftmost 
segmentation position in order. That is, if it is assumed that 
the left of al is the segmentation position, possible inter- 
pretations as one character are [al] and [ala2]. Because al 
shares the segmentation position of the left of a2, the 
sub-table number 2 is stored for al. There is no more 
characters subsequent to [ala2]; accordingly, 0 is stored for 
[ala2]. 

Next, the second sub-table is generated for the case where 
the left of a2 is assumed to be the segmentation position. 
There is only one interpretation as a character in the right 
side of the segmentation position, namely, [a2]. Therefore, 
only [a2] is registered at the second sub-table, and the next 
table number is set to 0 because there is nothing subsequent 
thereto. 

In the same way, three sub-tables are generated for the 
character U 9T. For the first sub-table, three interpretations 
[bl], [blb2] and [blb2b3] are generated. For the second 
sub-table, [b2] and [b2b3], for the third sub-table, [b3] is 
generated. The pseudo character recognition process is car- 
ried out for each segmented character and the representative 
character code is assigned thereto and stored in the column 
of representative character code in the sub-table. In FIG. 18, 
the representative character code assigned to each extracted 
character is represented by the braces { }. 

FIG. 19 is a flow chart showing an example of process of 
generating the representative character code table in the case 
where the plural interpretations of segmentation are 
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permitted, as shown in FIG. 18. At first, in step 81, an initial 
value is set. The partial character regions to the number of 
k included in a line are assumed to be p lf P 2 , . . . , and 
a variable Lis set to its list {p lt P 2 , . . . , P*}. Simultaneously, 

5 it is assumed that the partial character regions to the number 
of k are sorted from the left to the right. The flag S indicating 
whether or not there are plural interpretations of segmenta- 
tion now under processing is set to FALSE. Moreover, the 
list C of partial character regions possible to be integrated 

10 into one character is made to be empty. A variable n 
indicating the current sub-table number, a variable m indi- 
cating the position of the partial character string in the 
course of integration and a variable i indicating the position 
of the partial character region currently focused are all set to 

T? In step 82, it is examined whether the position of the 
p artial ch aracter region which is currently focused reaches 
the e nd oflheTinxroTno t rTfrat is, i and k are compared, and 
if i^k, the process proceeds to step 83 where the unproc- 

20 essed leftmost partial character region p f - is segmented and 
set on the list C. in step 84, it is supposed that the integration 
of the partial character region p, or an integrated partial 
character region including the partial character region p, 
with the right-adjacent partial character region p m ^ and the 

25 width of the character is calculated assuming that the 
integration is carried out. In step 85, it is determined whether 
calculated width exceeds the threshold value or not. If the 
width does not exceed the threshold value, further integra- 
tion is possible; accordingly, the flag S is set to TRUE in step 

30 86, p m+1 is added to the list C, the variable m is incremented 
by one and the process returns to step 82. In this case, since 
only the value of the variable m is changed without changing 
the value of the variable i, the integration of the partial 
character regions adjacent to the further right side is to be 

35 attempted in step 84. In such a way, the process is repeated 
until the character width exceeds the threshold value. In step 
85, if the integrated character width exceeds the threshold 
value, the integration attempted at last in step 84 is not 
carried out, and the process proceeds to step 87. Here, p ; to 

40 p m can be integrated. The list of partial character regions 
which can be integrated {p„ . . . , p m } up to this point are 
stored in the list C. . 
<i In step 87, it is determined whether the element in the list *J 
1 C is only p< or not, that is, it is determined whether plural 

45 parti al character region s can be integrated or not. If the 
elements in the list C is not only p,-, plural partial character 
regions can be integrated, and accordingly, sub-tables are 
generated based on those partial' character regions. In step 
88, all possible character regions including the leftmost 

50 partial character region are registered at the sub-table of 
number n in order of the number of partial character regions 
from the smaller one to the larger one. At this time, the size 
of each integrated character region is normalized, the feature 
is calculated, the representative character code is assigned 

55 thereto and registered at the sub-table. The next table 
number is determined by adding the number of pa rtial 
character regions in the integrated character regiontoth e 
v ajj^of the varia jlea ^and T he Tnext ta ble num ber of the last 
inte graled character region in the sub-table is s et to 0. Thus 

60 the sub-tables for the integrated character regions beginning ^ 
with i^rT partiaTctTaracter region are generated . vl 

In step 89, the varia ble i is incremented by on eto readv^" 
for the p roces s on integrated character region beginning with 
the"n exTpartial c haracter region, and tnereby the focus is 

65 sfuftedtojhe _next p artialcharacter-region . At the same time, 
the list C is reset to be empty and the variable n indicating 
the sub-table number is incremented by one, and moreover, 
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the variable m is set to the value of the variable i. Then the 
process returns to step 82, the integration of the partial 
character regions are attempted from the next partial char- 
acter region. 

In step 87, if the element of the list C is only p ( -, the flag 
is further examined in step 90. If the flag S is FALSE, p f 
is a partial character region having a possibility of an 
independent character In step 91, the size of the partial 
character region p t - is normalized, its feature is calculated 
and the representative character code is assigned thereto, and 
registered at the master table. After that, the variable i is 
incremented by one to reset the list C to be empty for the 
process of next partial character region. Then the process 
returns to step 82. 

If the flag S is TRUE in step 90, the partial character 
region p, is, for example, the rightmost partial character 
region in the group of partial character regions which can be 
integrated such as a2 or b3 in the example shown in FIG. 
16(B). In this case, in step 93, p, is normalized, its feature is 
calculated, and the representative character code is assigned 
thereto to generate the n-th sub-table. At this time, the next 
table number is 0. Because the partial character region p, has 
no partial character region to the right to be integrated, the 
sub-tables derived from one entry of the master table are 
completed. Thereby, in step 94, the variable n indicating the 
sub-table number is reset to 1. The variable i is incremented 
by one for processing the next partial character region, and 
thereby the list C is reset to be empty and the flag S is reset 
to FALSE. The variable m is set to i. The process returns to 
step 82 so that the process is started from a partial character 
region which is newly focused. 

When the process proceeds to the rightmost partial char- 
acter region in the line and is completed, the relation of i and 
k is i>k. In step 82, if it is determined that the condition is 
satisfied, further integration process is unnecessary. In step 
95, it is determined whether the list C is empJy-Qr not. If it 
is not empty, processes in the step 87 andthesubsequent 
steps are executed on partial character regions left in the list 
C and the master table or sub-tables are generated. If the list 
C becomes empty, processes are completed. By such 
processes, for example, the representative character code 
table having two-stage structure as shown in FIG. 18 is 
generated. The generated representative character code table 
is registered together with the inputted document image. 

If the plural interpretations of character segmentation are 
permitted, the bi-gram table which is an index for retrieval 
is extended so that it can deal with the plural interpretations 
of the character segmentation. That is, as to the two char- 
acters in the bi-gram table, it is necessary to explicitly notify 
whether they are one of the plural interpretations of char- 
acter segmentation or not, and if so, to which interpretation 
of character segmentation they belong. Therefore, the 
bi-gram table is extended as follows: in the table of position 
in the document image stored for each bi-gram table shown 
in FIG. 12, the document ID and the block ID are left 
unchanged because they are used in common,' but the 
position of each of the first and second characters is repre- 
sented by a set of (p, n, m). These p, n and m represent a 
character position in a block, namely a position in a repre- 
sentative character code table, the sub-table number corre- 
sponding to an interpretation of character segmentation and 
a position in the sub-table, respectively. 

FIG. 20 illustrates an example of the bi-gram table in the 
case where the plural interpretations of character segmen- 
tation are permitted. If there is only one interpretation of 
segmentation, n is set to 0 and m is ignored. This is 
applicable to the example of the bi-gram " in FIG. 20. 
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If there are plural interpretations of segmentation and the 
individual character in the bi-gram is one of them, n and m 
represent the sub-table number and the position in the 
sub-table, respectively. In the example of plural candidates 

5 of character segmentation of " Qim" shown in FIG. 16, for 
instance, the character positions corresponding to the 
bi-gram {[al]} {[a2]} generated by dividing the character 
"ft" into two parts are (116, 1, 1), (116, 2, 1), and the 
character positions of the bi-gram {[ala2]} {blb2b3} gen- 

10 erated by correctly segmenting " en" and are stored as 
(116, 1, 2), (117, 1, 3). Thus the bi-gram table generated 
based on the representative character code strings in the 
inputted document image is registered to be used in the 
retrieval. 

15^ When the retrieval is executed, it is possible to generate 
the bi-gram of the representative character code string for 
the keyword in the inputted retrieval expression in the same 
way as the case of the document image, and retrieve it from 
the registered bi-gram table. Because the keyword is input - 

20 ted by the keyboard 3, for example, the retrieval executing 
element 13 receives it as a character code, and accordingly, 
there are no plural interpretations of segmentation position, 
and only one segmentation position is determined. In the 
bi-gram table generated from the document image, a bi-gram 

25 in the case of correctly segmented character is also regis- 
tered; therefore, matching with such bi-gram of the correct 
segmentation case is detected in the retrieval. 

As described above, as to the keyword having three 
characters or more, it is necessary to determine whether the 

30 bi-grams successively exist in the same document or not. 
Now, if it is to be determined, whether these two bi-grams 
have the same document ID and the same character block ID 
or not, and whether (p, n, m) representing the position of the 
last character of the former bi-gram is the same as the 

35 position of the first character of the bi-gram desired to be 
determined whether it follows immediately or not should be 
examined. If the two bi-grams have the same document ID 
and the same character block ID, and moreover, (p, n, m) of 
the last character of the former bi-gram is the same as the 

40 position of the first character of the latter bi-gram, they are 
determined to be successive. 

It is possible to combine the construction in the case 
where classification into plural categories is permitted as 
described in the first variation and the construction in the 

45 case where plural interpretations of character segmentation 
exist as described in the second variation. 
Second Embodiment 

Next, the second embodiment of the document processing 
apparatus according to the present invention is explained. As 

50 described above, in the first embodiment, there is a possi- 
bility of retrieving a character string such as not acceptable 
as a word in a document because the character string to be 
searched is converted into a string of a similar character 
category, and thereby the retrieval is executed by simple 

55 matching. The object of the second embodiment is to 
prevent to retrieve a document including such a character 
unacceptable as a word and improve the precision of the 
retrieval. 

FIG. 21 shows a construction of the second embodiment 
60 of the document processing apparatus according to the 
present invention. Elements corresponding to elements of 
the first embodiment have the same reference numbers as 
those of the first embodiment, and the explanations are 
omitted. 

65 An image inputting element lOLjp av be the scanner 5 
s hpwn in FIG. 1, for example, and reads the document as an 
image. An image displaying element 102 may be the display 
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2 shown in FIG. 1, for example, and displays the inputted 
image or confirmation of a result of the process. A similar 
character classifying element 103 is the same as the similar 
character classifying element 11 in FIG. 1, and classifies 
object characters into categories consisting of similar char- 
acters based on the image feature. A text region extracting 
ele ment 104 constitutes the fun ction o r. a part of the pseu 3o 
cKfacterjscc^ni zing element JSinJ EIG whiciuextiacts 
a text region in the document image, an<j farm er extracts the 
ima ge per every character^ A pseudo character recognizing 
element 105 constitutes the function of another part of the 
pseudo character recognizing element 12 in FIG. 1, which 
classifies the character images into the appropriate similar 
character categories and assigns representative character 
codes thereto. A category word detecting element 106 
extracts a representative character code string which con- 
stitutes a word from the group of representative character 
code strings. A category word converting element 107 
converts the category word into characters. A CPU 108 
controls the whole apparatus. 

A storage device 109 includes the external storage _device - 
7 sh own in FIG. 1 an d also stores a character categ ory 
reta inin g element 111 , a pseudo character recognition result 
storing element 112, a category word dictionary 113 and a 
code conversion table 114. The CPU 108 stores the program 
for controlling the whole apparatus. T he character categ ory 
r<M ?jning element 113 stoxes^thej:ateg6rie s_ckssified .by.tbe 
si milar charact eLdas^i&jqg_elejneiiQ03.andxorjesrx3nding 
i mage feature . For example, the character category retaining 
element 111 stores the similar character category table, 
character code-category correspondence table and so forth. 
The pseudo character recognition result storing element 112 
retains the representative character code string converted by 
the pseudo character recognizing element 105. The category 
word dictionary stores at least the representative character 
code strings constituting a word and the relation of corre- 
spondence between the word and part of speech. In some 
cases, a character word consisting of at least one character 
represented by the representative character code string is 
retained. Moreover, a part of speech connection dictionary 
which shows the relation of connection between the words 
is also retained. The code conversion table 114 records the 
correspondence between the representative character code 
string representing a word and a character string. In the case 
where the character words are retained corresponding to the 
category words in the category word dictionary, it is possible 
to replace the code conversion table 114 with the category 
word dictionary 113. 

The details of each process are described as follows. The *i 
process in the similar character classifying element 103 is (so 
the same as that of the first embodiment; therefore the 
explanation is omitted here. The similar character category- 
table and the character code-category correspondence table 
generated by the similar character classifying element 103 
are retained in the character category retaining element 111. 
It is unnecessary for the similar character classifying ele- 
ment to execute parsing per every process if once the feature 
to be parsed is determined; accordingly, itJs 3 j> ossible^o i ; 
ex ecute parsing in another d ev ice and only the result i s. 
stojsdj njhe character cate gory ^sjQrjmi_ element 111 to b e 
usedTTE e charac ter categorystorin g element lll_s tQtes_the 
simil^^s^S^^SBgp^^^^r example, specifically, as 
shown in FIG. 6 and the character code-category correspon- 
dence table, for example, as shown in FIG. 7. ~J 

The category word dictionary 113 and the code conver- 65 
sion table 114 can be generated by replacing the character 
codes of the conventional word dictionary with the repre- 



sentative character code string by utilizing the similar char- 
acter category table and the character code-category corre- 
spondence table stored in the character category storing 
element 111. FIG. 22 illustrates an example of the category 
word dictionary in the second embodiment of the document 
processing apparatus according to the present invention. In 
this example, the representative character code string indi- 
cating a word, part of speech of the word indicated by the 
representative character code string, and a character string of 
the word indicated by the representative character code 
string are related to one another. Some of the conventional 
dictionaries store a character word and the corresponding 
part of speech in a pair. In such a case, by obtaining the 
representative character code string corresponding to the 
character word and rearranging them, the category word 
dictionary 113 such as shown in FIG. 22 is available. In the 
category word dictionary 113, for the representative char- 
acter code string indicating a word, particularly, as to a word 
which conjugates, not only a stem of the word but also a 
suffix is stored separately. Further, as described later, the 
category word dictionary 113 includes a part of speech 
connection dictionary which shows the relation of connec- 
tion between the stem and the suffix, and moreover, an 
auxiliary verb, a postpositional particle or the like subse- 
quent thereto. Otherwise, it may be possible to store all 
conjugation forms. 

FIG. 23 illustrates another example of the category word * 
dictionary in the second embodiment of the document 
processing apparatus according to the present invention. The 
category word dictionaj y^H?Lg*a y_ ^ e represented in vari ous 
formsoIfleiLtha n the form of thetable s howing the relation 
amon g the representative character code~stxing r word're p- 
res^ntediaLtl ie^character code and the correspondin g part of 
speech as.shownicLEIG. 22. For example, the category word 
dictionary ?13 can b e constr u cted in the form showni nJEIG . 
23 so j&_t p_ef5ciently execute the matching^ Diocess. The 
categ ory word dict ionary 113, for example, uses "tri e" 
^^which_is_intr oduced bv~"Tfi<r ana^its^V c CTcation", Aoe , 
Information Processing, Vol. 34, No. 2, February 1993, and 
constru cts the trieso that a ll category^wo/d s be mnnin g with 
edcT T repre r s£ntai iy£-xharacj^^ has the 

"23Ttte 
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45 



str ucture such that the category w ord can be ext 
travejsingth e no d es to the termi iiaTTro d eT~I5"" FI f 
terminal nodeTare represented!^ the symbols ©. 

The example shown in FIG. 23, the category word dic- 
tionary 113 is illustrated, which can be used for matching the 
following seven words, " X*'\ " ?t&\ " X^ y \ " X* W^", 
" Xit", " *ttit g", and " *<tm*". The representative char- 
acter code strings generated by converting the seven words 
are as follows: " X*"; " " a X&m*"; " Xit"; 

" Xit&B" and " Xik®*". These strings are represented by 
trie as shown in FIG. 23. Characters in a character string is 
matched one by one from the top of the string with the 
category word dictionary 113, and the character string that 
can be reached to the terminal symbol © is accepted as a 
word and outputted. FIG. 23 shows the trie which executes 
matching about the above seven words only, but actually, the 
trie is generated by converting the all words into the repre- 
sentative character code strings to be the category word 
dictionary 113. The corresponding information such as parts 
of speech or the character words may be connected to the 
terminal symbols. Otherwise, the table as shown in FIG. 22 
and the trie dictionary as shown in FIG. 23 may be held 
together. Needless to say, a category word dictionary 113 by 
the other data structure may be available. 

FIG. 24 illustrates an example of the code conversion 
table of the second embodiment of the document processing 
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apparatus according to the present invention. The code storing element 112 are divided into plural processing units, 

conversion table 114 particularly pairs the representative Then the divided processing units are processed in order, 

character code string with the word corresponding thereto In step 122, it is determined whether the unprocessed 

and stores them. Here, the information about the part of processing units are left or not, and if all processing units 

speech is also added. The data held in the category word 5 have already been processed, the process in the category 

dictionary 113 and the code conversion table 114 are almost word detecting element 106 is completed. If there are any 

the same; therefore, the data are shared in the actual pro- processing units left unprocessed, in step 123, one of the 

cessing. However, here the data are separately dealt with to unprocessed processing units is specified and the value of 

simplify the explanation. the variable N is set to the number of characters in the unit, 

The processes described above are carried out for prepa- 10 and the values of both of the variables I and J are set to 1 . 

ration of the data necessary to segment the words from the The variable I indicates the processing object character in 

document image. Accordingly, it may be possible to carry the processing unit, and the variable J indicates the hierarchy 

out the above processes by another apparatus and prepare of the node in the trie in the category word dictionary 113. 

the similar character category table, character code-category In step 124, the value of the variable I is substituted for the 

correspondence table, category word dictionary 113 and the is variable P and the value of the variable T is set to 1, and a 

code conversion table 114 in advance, and thereby only the region BUFFER is cleared to be NULL. The variable P 

relevant data may be used. indicates the position of the character at which the word 

Now the document registration process is described. The detection is newly started in the specified processing unit, 

text region extracting element 104 parses the binary-value and the variable T indicates the number of characters newly 

digital image inputted through the image inputting element 20 detected word. The detected words are stored in the region 

101 to segment the character region, and moreover segments BUFFER in order. 

each character. The process of the text region extracting .rjAa step 125, it is determined whether all characters in all 

element 104 is the same as a part of the pseudo character -the specified processing units have been processed or not by 

recognizing element 12 in the first embodiment, namely, examining whether the variable I is within the range of the 

processes in steps 51 and 52 in FIG. 8; therefore the 25 variable N or not. If there are any characters which are 

explanation is omitted here. unprocessed, the attempt of matching the I-th character in 

The pseudo character recognizing element 105 executes the specified processing unit with all nodes which have 

the process per every character region segmented by the text possibility of connection to the (I-l)-th character in the. 

region extracting element 104. The process in the pseudo processing unit among all nodes in J-th hierarchy in the 

character recognizing element 105 is the same as a part of 30 category word dictionary 113 is executed in step 127. The 

the pseudo character recognizing element 12 in the first terminal symbol is regarded as "wild card" which matches 

embodiment, namely, processes in step 53 and the steps with all characters. In step 128, it is determined whether the 

subsequent thereto in FIG. 8, except for the registration matching character exists or not. If Jherejsno matching 

process to the bi-gram table in step 54. ch aracter, the character string before thT Tm^h lrlHeTis^o t 

The pseudo character recognizing element 105 converts 35 accepted asa word, anJconsequentlv. matching attempt is 

the each character image segmented by the text region exeaitedji^ at 

extracting element 104 into the representative character code whiclHr^pre^mgword matching was_s tartedL That is, in 

of the similar character category. This process is the same as step 132, thelmrnber of characters newly detected as a word 

that shown in FIG. 10; accordingly, the explanation is and stored in the variable Tjs_adde*Lto the position of the 

omitted. The obtained representative character code strings 40 character at which the word detecting is startedjtoredjilihe 

are stored in the pseudo character recognition result storing variahle,P, whereby the position of the character at which the 

element 112 per character block together with the position in next matching is started is calculated and substituted for the 

the image and the inputted image. For example, as shown in variable I. Further, in step 134, the words having been 

FIG. 13, the representative character code and the position detected and stored in the region BUFFER are transferred to 

of the rectangle which occupies in the image represented by 45 be stored in the storage device 109, and in step 135, the 

(top-left x^coordinate, top-left y-coordinate, width, height) variable J is set to 1 so that the matching is started from the 

can be stored. top of the category word dictionary 113. and then the process 

The category word detecting element 106 executes match- returns to step 124. In step 124, the value of the variable I 

ing of the representative character code strings stored in the is substituted for the variable P to avoid newly starting word 

pseudo character recognition result storing element 112 by so detection from the same position. Thus the process is 

the pseudo-character recognizing element 105 with the continued to newly detect a word. 

category word dictionary 113, and extracts the representa- If the characters attempted to match in step 128 exist as V 

tive character code strings which are accepted as the words. the nodes having possibilities of connection to the (I-l)th 

FIGS. 25 and 26 are the flow charts showing an example of character in the processing unit in the J-th hierarchy of the 

operation of the category word detecting element in the 55 category word dictionary 113, it is furthe r determined in step 

second embodiment of the document processing apparatus 129 whether the detected matching characters include the 

according to the present invention. Here it is assumed that terminal symbol or not. If thTterminal symbol is included, 

the category word dictionary 113 has the data structure of there is a possibility that a w.ord exists before the character 

trie shown in FIG. 23. position indicated by the variable I; therefore, in step 130, 

First, in step 121, the punctuation marks detected by the 60 the detected word is stored in the region BUFFER and the 

pseudo character recognizing element 105 are again detected w ord4saqth is stored in the variable T. — 

from the category character string, and a representative The number of characters matched in the matching 

character code string from a character of the top of the line attempt in step 127 is not limited to one: for example, there 

to a punctuation mark or a representative character code is a case where both of a certain character and a terminal 

string between the punctuation marks is regarded as one 65 symbol are matched. In step 131, it is determined whether 

processing unit, whereby the representative character code the matching characters are only the terminal symbols or 

strings stored in the pseudo character recognition result not, and if they are only the terminal symbols, there is no 



10/30/2003, EAST Version: 1.4.1 



5,943,443 



23 



24 



matching character having more length in the category word 
dictionary 113. Therefore, in step 134, the words having 
been detected and stored in the region BUFFER up to this 
point are transferred and stored in the storage device 109, 
and in step 135, the value of the variable J is set to 1 for 
detecting a new word. Then the hierarchy is reset to the top 
of the category word dictionary 113, and the process returns 
to step 124. In step 124, the value of the variable I is 
substituted for the variable P to avoid newly starting detec- 
tion of a word from the same position. Thus the process 
continues to newly detect a word. 

If it is determined in step 129 that the matching characters 
do not include the terminal symbol, or, if it is determined in 
step 131 that the matching characters are not only the 
terminal symbols, each value of the variables I and J is 
incremented by one to execute next character matching in 
step 133, and then the process returns to step 125. 

By repeating such processes, whenever a terminal symbol 
appears, a word is detected and stored in the storage device 
109. In step 125, it is confirmed that the processes for all 
characters in the specified unit are completed, and the words 
stored in the region BUFFER are transferred and stored in 
the storage device 109, thus the processes for the processing 
unit are completed. In the case it is determined that there is 
a processing unit left unprocessed, the unprocessed unit is 
selected and the matching attempt is carried out for the 
characters one by one as described above for detecting the 
words. If the processes are competed for all of the process- 
ing units, the process in the category word detecting element 
106 is completed. 



redundantly stored in the storage device 109. The redundant 
category words may be retained, or may be deleted except 
one. The pseudo character recognition result storing element 
112 stores the representative character codes and informa- 

5 tion of positions thereof in the image for the case where the 
position of appearance of the word is desired to be known. 
However, if redundancy is to be excluded, the pseudo 
character recognition result storing element 112 may be 
constructed so that the plural pieces of positional informa- 

10 tion are stored corresponding to one word. As a method of 
displaying the positions of appearance of words by utilizing 
the positional information, various techniques publicly 
known which can be adopted; therefore detailed explanation 
is omitted here. 

15 According to the processes described so far, the category 
words represented by the representative character codes can 
be extracted. However, those processes are merely detection 
of the words entered in the word dictionary, and there is no 
guarantee that the extracted words are accepted as well- 
20 formed Japanese words. For example, a compound noun 
might possibly be extracted by incorrectly separating at a 
position which is different from the border between the 
original nouns, or a word accompanied by an incorrect suffix 
or postpositional particle might be extracted. Such ill- 
25 formed cess as the particular language (Japanese) is cor- 
rected by verifying the possibility of connection between the 
words in view of part of speech. 
a1L For example, if a Japanese sentence 
- > $ ft* & • * -C* S" is represented by a represen- 



As a specific example, the case of matching of the 30 tative character code string, it turns to be, for instance, 



representative character code string " Xit&s?" as the pro- 
cessing unit by utilizing the trie shown in FIG. 23 is now 
considered. At first, matching of " X" is attempted and it 
succeeds, and then next matching of " it" is carried out. 
Then attempt of matching is carried out as to the represen- 
tative character codes " and u it 77 in the, second 
hierarchy of the trie, all of which have possibilities of 
connection to " As a result, " ft" succeeds in matching. 
Since the matching representative character code is not the 
terminal symbol, the matching of the next representative 
character code " £ M is further attempted. That is, among the 
representative character codes in the third hierarchy, the 
matching with " " @" and the terminal symbols which 
have possibilities of connection to " ft" is executed. In this 
case, the terminal symbol and " are matched. Because 
the terminal symbol is included, " Xik" is detected as a word 
and stored in the region BUFFER. However, the represen- 
tative character code is not only the terminal symbol, and 
therefore the matching process is continued. The matching 
of the next representative character code with "M" in 
the representative character codes in the fourth hierarchy, 
which has a possibility of connection to " is attempted. 
However, since the representative character codes do not 
match with each other, the word "3C<fc* in the region 
BUFFER is transferred and stored in the storage device 109. 

The next word matching is started with " which is 
subsequent to the detected category word "Kft". Such 
processes are continued to execute to the last character of the 
processing unit, and further executed until no unprocessed 
unit is left. According to these processes, all category words 
which exist in the category word dictionary 113 and appear 



50 



60 



"***fi3*i**5irr**0" If the detection of the cat- 
egory words as described above is also carried out for a 
processing unit specified therefrom " ^jg. 3 ft* *s ffi~Ci> £>", 
for example, a category word " Stft" is detected from the 
category word dictionary 113, and and "fiS" are 
further detected. The category word includes the 

character words " i£ifl" and "3". " £(sa)" includes 
"$b(sa)" (a suffix of the sa-row irregular verb, a detailed 
part of speech in Japanese particular grammar) and " &(ki)" 
(a stem of the lower Indian verb, also a detailed part of 
speech in Japanese particular grammar). However, taking 
the context into consideration, it is incorrect that a verb 
having a stem " follows the noun " nm" or 
according to the Japanese particular grammar. Besides, the 
noun " is never followed by a causative auxiliary verb 
according to the Japanese particular grammar, too. 
Therefore, the interpretation for combination of the words 
and is correct. Similarly, the combination of 
"3" and is also correct. In this case, actual parts of 
speech are the sa-row irregular verb "gM + Z" and the 
causative auxiliary verb u ft 4". 

The errors in the word extraction as described above also 
occur in parsing of the ordinary character string, but it can 
be said that such errors frequently occur in dealing with the 
representative character code string which has more indefi- 
niteness. Thereby, the precision of word extraction can be 
improved by verifying the possibility of connection to the 
previously detected word whenever a new word is detected. 

For the verification as described above, the part of speech 
connection dictionary stored in the category word dictionary 
113 can be used. FIG. 27 illustrates an example of the part 



in the document can be stored in the storage device 109. t-65 of speech connection dictionary in the second embodiment 
Generally, the same words appear plural times in a of the document processing apparatus according to the 
document, and consequently, the same category words are present invention. The part of speech connection dictionary 
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shown in FIG. 27 indicates the relation of connection detecting element 106 is carried out again to extract another 

between the parts of speech of successive two words. Parts candidate of the word. 

of speech of the former word are entered in the row and The flow of the above-described processes is now 

those of the latter word are entered in the column of a table. described with a specific example. With the previously used 

The values of the table indicates: 5 example « #5fc ^ % & s ^ ft ^ & Q „ how the parts of 

L^-l; part of speech in the row .can be followed by part speech in the c , ause are determined is 

or speecn in trie column , explained. First, the characters in the representative charac- 

1^=0; part of speech in the row . cannot be followed by t „ „ ■ . • « M . „ „ „ 4 , 4 , . , 

'part of speech in the column / C ° dc 5 nng ** *** m a £ m P ted *> be matc ^ d 

Whenever a category word is detected, the relation of 10 ^ the category word dictionary U3 from the top in order, 

connection between the words is verified by utilizing the and a representative character code string having a 

part of speech connection dictionary as shown in FIG. 27, possibility of a word is obtained. According to the contents 

for example. of the category word dictionary 113 shown in FIG. 22, there 

However, the category word has a possibility of repre- fe a Possibility that this representative character code is any 

senting plural character words by one representative char- 15 of the words as follows: " gST (noun), " (noun) and 

acter code string. Accordingly, in the actual processing, the " mm" (a stem of the sa-row irregular verb). Hie represen- 

verification of possibility of connection is executed regard- tetive character code string next detected is " $" having a 

ing all parts of speech of the plural character words corre- f u M „ , „j c . 

spending to the representative character code string p0SSlblbty ° f ^ * < a suffix of thc ™ irrc ^ Iar 

extracted as a word. If the possibility of connection is 20 verb ' mdic ating mizen f orm) or u (a stem of the lower 

verified for only one part of speech, the representative Indian verb). With reference to the part of speech connection 

character code string is accepted as a word. dictionary shown in FIG. 27, neither relation of connection 

FIG. 28 is a flow chart showing an example of verification (noun)— <a suffix of the sa-row irregular verb) nor 

process for the relation of part of speech connection. In this (noun)— <a stem of the lower Indian verb) exists; therefore, 

process, whenever the category word is detected by the 25 at this P oint ' ^ ^ representative character code string 

category word detecting element 106, it is inputted and the " having a possibility of a word turns out to be " 

possibility of connection between words is verified in view ( a srem of the sa-row irregular verb). Accordingly, the 

of part of speech. First, in step 141, the category word first representative character code string " £ia" is accepted as the 

detected in the processing unit is inputted and substituted for category word. 

a variable WORD1. In step 142, it is examined whether any 30 The representative character code string detected next is 

part of speech available to the category word is able to be the « ^ ( aux fli ar y verb), which is capable of connecting to a 

top of the clause or not If the category word does not suffix indicating the mizen form according to the part of 

include a word of part of speech able to be the top of the speech connection dictionary shown in FIG. 27. Therefore, 

clause, the category word cannot be acceptable as Japanese; _ . , - . , . A 

therefore, the representative character code string is rejected 35 f7 f ~ code string « * "is determined to 

as a word suffix of me sa ' row irre gular verb and accepted as the 

If it is determined in step 142 that the category word can °^ 0Ty WOrd ' Further ' the re P resentative ^ 

be the top of the clause, the next category word is inputted stnn S " K ako accepted as the category word because 

from the processing unit in step 143 and stored in a variable ^ auxilia ry verb can be the end of a clause with reference 

WORD2. In step 144, the possibility of connection between 40 to the P art of connection dictionary shown in FIG. 

two category words stored in the variables WORD1 and 27 ' Thereby the representative character code string 

WORD2 is obtained by searching the part of speech con- "«zS.£*t&" is accepted as one clause, and a stem of the 

nection dictionary as shown in FIG. 27. If there is no sa-row irregular verb " is detected as an independent 

combination of parts of speech which has the relation of word (namely, a word which can constitute a clause by 

connection therebetween among all parts of speech available 45 itself). 

to the two category words, the first category word stored in If the segmentation positions of the words are incorrect 

the variable WORD1 cannot be accepted as Japanese, and and thereby the verification of the possibility of connection 

consequently the first word is rejected. If there is any in view of part of speech is impossible, the process returns 

combination of parts of speech which has the relation of to the top of the clause now under processing, and the 

connection therebetween among all combinations of parts of 50 verification of the possibility of connection in view of part 

speech available to the two category words, the first category of speech is executed again after the change of the segmen- 

word stored in the variable WORD1 is accepted as a tation positions. According to the above method, the words 

well-formed word. Further, in step 146, the category word can be extracted while keeping the correct border between 

stored in the variable WORD2 is transferred to the variable the original words in a compound noun or a possibility of 

WORD1. In step 147, it is determined whether the process 55 connection between the words in well-formedness as a 

has reached the end of the processing unit or not, and if the particular language. 

process has not reached the end, the process returns to step By the processes as described above, the category words 

143, where the rest of the category words are inputted in having the relation of connection at least accepted as Japa- 

order for verification of the possibilities of connection nese can be detected. Next, the accepted category words are 

between words in view of parts of speech in the same way. 60 converted into the words constituted by corresponding char- 

The category word accepted in step 145 will be able to be acters. This process is carried out in the category character 

processed more precisely by the category word converting converting element 107 by utilizing the code conversion 

element 107 later by storing with which part of speech the table 114. The process is very simple. The category words 

category word is accepted in step 145 in the storage device accepted as the words are searched in the code conversion 

109. If a category word is rejected as a word, the process 65 table 114 and all available character words corresponding to 

returns to the top character of the clause now under each of the category words are outputted. However, because 

processing, and the word extraction by the category word the words used for searching are the independent words, 
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only the words regarded as the independent words in the Therefore the similar character classifying element 103 

preceding verification process of the relation of connection uses the minimum distance identification method here to 

in view of part of speech are outputted. Thus the words generate the similar character category table 41 and category 

effective to the retrieval are obtained, and moreover, it is word dictionary 113 which is as same as before, but in the 

possible to restrain the number of words to be outputted. 5 identification process in the ps eudo character reco gnizing 

As described so far, word extraction from the document elementiOS, the Euclid's distancebetween the featurroTthe 

image can be carried out without using the process which in putted character image and the cate^ rx-iepxesentative 

requires much computation cost as the usual character vectcuiflLfcac ffsimilar character cate gQryi 5_calculated . and 

recognition process. Because the verification of relation of t he categ ori es' havin g _distance£^th in aran ge from th e 

connection in view of part of speech has been executed, JQ minim um to the N-th one are regarded as the character 

ill-formed words are not included in the extracted words. cat^gojies_oLthe_mputted characters, and tbereEylhe rep- 

Therefore,if these words are used for retrieval, it is expected resentative character codes are outputted. However, a 

that the result of retrieval will have high precision. In this threshold value Dt is set to the distance so that the character 

second embodiment, the representative character code string categories which have a distance longer than the threshold 

is converted into the character code string by the category value Dt are not adopted as character categories of the 

word converting element 107, and therefore the retrieval can 15 inputted characters, thus it is possible to limit the character 

be executed directly by the character code string without categories corresponding to one character type, 

converting the keyword in the retrieval expression into the FIGS. 30 (A) and (B) illustrates an example of conversion 

representative character code string as shown in the first into the representative character code string in the "case 

embodiment. where N«2 in the first variation of the second embodiment 

Now a first variation of the second embodiment of the 20 Q f the document processing apparatus according to the 

document processing apparatus according to the present prcsent invention. As the example, a case where the char- 

lnvention is described. Similar to the first variation of the _ u . fck 

first embodiment, the precision of conversion of the char- acter ^ i ft« »M is converted into the represen- 

acter image into the representative character code by the lative characler code stnn S above-described method 

pseudo character recognizing element 105 is improved in 25 15 taken - Here > 11 IS assumed that N-2. Moreover, it is also 

this example. In the above-described example, the first assumed that only the similar character category locating at 

variation of the first embodiment, the minimum distance me minimum distance within the threshold value Dt exists 

identification method which selects the nearest features in for the character " ffi". 

the feature space for assigning the representative character The representative character code string converted in the 

code of each similar character to the character image as case of N=l, namely, the minimum distance identification 

shown in step 65 in FIG. 10 is used. However, in many mcthod fa « -Mfab|| . For examp l e , it is assumed that 
cases, the actual features of the character image varies by 

blurring or distortion of the image, and thereby the clusters me charact er " * not included in the category of the 

of the similar characters complexly overlap with each other. representative character code " ja" of the third character. In 

In such a case, the minimum distance identification method this case, the character string & m 5 SIS" cannot be 

has a strong possibility of identification error. 35 reproduced from the representative character code string 

FIG. 29 illustrates an example of identification error of the "a»-£^Bh|» , ' 

representative character code. For example, a case where KT ~r . , 4 . 

„i„„*_ „i n „ A - - ■ t - Next, the representative character code stnng correspond- 

two clusters al and a2 exist m a certain space of two- t XT ~ , . . . . . Z - - 

f . _ _ , „ M - CT|r . *yci ' a a k mg to N-2, namely, the category which has the next mini- 

dimensional feamres as shown m FIG. 29 is considered. An j- * • .u . • / . _ . . , 

, . . . „ . . A At _ , : An mum distance within the threshold value Dt, is obtained, 

unknown character x is originally belongs to the cluster al. 40 _ 

However, according to the minimum distance identification hereby, for the characters " " »'\ " W", " and « 

method, the unknown character x is determined to belong to the representative character codes "f", " " W, " m" 

the nearest cluster a2. Such identification error occurs as and " are obtained, respectively. If the character " W" is 

well in the case where the two clusters overlap with each included in the category of the representative character code 

other and the feature of the unknown character x exists in the 45 tt ^„ ODtained M described above> it become s possible to 

common portion of two clusters. , . . . . ^ v „ ^ . 

t- i . . • A . . „ reproduce the character stnng S & W S§ M . 

For resolvmg the identification error problem, m the first / K ... , ■ * ?. . .v" . , 

variation of the first embodiment described above, one X\6, } a ** 7* ^"^T * ^ ? T° 

character type is registered to the plural similar cha acter N ^"f. element 106 &0m the Ca e f. 0ry sin ^ 

categoriesbyutilizingthe € -componentextensionm e thod.If 50 consbtutrng one or more representee < character codes 

°. 3 j Jr.. t1 £ . 4ju . . obtained for one character. The process in the category word 

the category word dictionary 113 is generated based on the A „ , t1A , „ A r t « . . 6 7 . 

•mi,^ * * ui a a 4 * *t_ detecting element 106 attempts to match all representative 

similar character category table prepared according to the , ^ _ , ... / ^~^~r- — « 

. , m , i i j-rr f * i_ character codes with the category word dictionary 113 and 

above-described way, plural different representative charac- . iU ' — - a - -J- 

. # . * *u • i u . a stor es the representativecharacter codes-accepted as the 

ter code strmgs represent the same single character word wo rds in lEg^toTT'^'-devi6&-409-- whironr~cha ' th 

because one character type belongs to the plural categories. 55 . — t rr-i r-rc^ — *iT . ang^ag e 

, Jr r 6 above-described method. 1 hat is, the representative charac - 

For example, if the character is registered to the . , M ^„ „ .„ . - . , , u c A . 

* „ t , 6 ter code or is adopted as the first character, and 

category , and the character ft" is registered to the ^ n ±Q match ing is attempted a sto whether the represen- 

categories " ft" and " the word " B« w is represented by totivc characteTcode " oP^iTsliblequent thereto 

two representative character code strings"®^" and U MW- 60 as the^ecorid "charact er or not. In this way, the matching 

In this way, if one character word correspond to the plural attempTis contiriueduntil the terminal symbol is detected. If 

different representative character code strings, the size of the the terminal symbol is detected, the representative character 

category word dictionary 113 is increased as a result. The code strings which have been detected are stored in the 

increase of the dictionary size of the category word dictio- storage device 109 as the category words. In the course of 

nary 113 causes a complexity in the construction of the 65 the process, plural category character strings are generated, 

category word dictionary 113, and besides, provides bad but at the point where no subsequent character exists, the 

influence on the word extraction speed. representative character string may be rejected. 
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rh^J^T*' m th !,^ a l C !l! ng P' 00 ! 55 UP t ° J tbe SCCOnd 3 ' 000 Ch ^acter types such as in the conventional character 

character, it is assumed that there are three word candidates recognition, and it is sufficient to execute comparison of the 

W B.B W , "0«" and " *at". Here, the terminal symbol is features of the characters used in the character words cor- 

detected in the category word dictionary 113, and as a result, responding to the category words detected in the category 
two representative character code strings "Baft" and 5 word detectia g element 106. For example, if the detected 

are assumed to be accepted as the words. The representative category word can be interpreted as three-character words, 

character code subsequent thereto is " g" or but if comparison of the features with three characters may be 

there is no word in the connection of or ex ^ cuted at ea ch character position in the detailed identifi- 

• .i_ , cation process. 

„L'h!f 10 FIG.31showstheconstructionofthesecondvariationof 

S^^SS ^fr 0 " 1 ^ with will the second embodiment of the document processing appa- 

not be executed in the later matchmg process. In the next ratU s according to the present invention In the fi£re, 
matchmg of the fourth character " ffl", if there is no word in elements corresponding to elements in FIG. 21 have the 
the connection of "Hr«g"-"£ w or in the word same reference numbers as those in FIG. 21, and the 

dictionary for matching, the word candidates beginning with is explanations are omitted. A detail identifying element 110 
the category " are rejected in the later matching process extracts detailed features of inputted unknown character and 
because there have been no representative character code compares them with the features of the character in the 
strings beginning with the category " W M accepted as a word simila r character category, and thereby the character type is 
as the result of matching with the terminal symbol. The UQ iquely determined. Adetailed identification dictionary 115 
process is further continued to carry out matching of the 20 s ! or f s detaile d features of the character image per every 
seventh character, and if the character subsequent to the sim ilar character category. 

representative character code string " g K$ IBW«" is only - ^ detail idcatif y m S element 110 and the detailed iden- 
the terminal symbol, the representative character code string *j ficaUon dictionary 115 are now explained further. The 
" a tt* mshm" is accented as a wnrH detailed identification dictionary 115 is generated by utiliz- 

nc^S^^lt^tZ7t th, w^H ,r f 25 ^ ^ si ^ar character category table which is the resu It of 
me words accepted as the word by the word dictionary classification into similar characters by the similar character 
for matching and beginning with the first character " r* or classifying element 103. As the features necessary to gen- 
tt ^r" are "a. ft" and " Siafc^ ffijftig". Here, by adopting the era *e the detailed identification dictionary 115, the features 
longest-match principle, only the representative character in the conventional character recognition apparatus can 

code string detected as a longer word " S «^ £JKj55" is left 30 be uscd * FIGS - 32(A)-(E) illustrate an example of features 
as the word candidate and outputted. Further, as described uscd for g eneratm g the detailed identification dictionary in 
above, whether the outputted word can be accepted as a mc variation of the second embodiment of the 

word or not is examined by matching A with the part of document processing apparatus according to the present 
speech connection dictionary, and as a result, only the invention. As the features to be used, for example, the 
category word strings acceptable as words are outputted. 35 features utilizing a directional attribute disclosed in Japanese 
/V A Thus it becomes possible to extract words with more Patent Application Laid-Open No. 5-166008 (1993) can be 
r "precision by relating plural similar character categories to applied. The features are obtained by measuring the cond- 
one character image. In this way, errors in the pseudo nuityofpixe Is in plural directions regarding outlining pixels 
character recognition having been occurred to a certain m m . e cna racter image, which represents directions or corn- 
character in the selection of the category characters by the 40 pl cx * tv of hnes constituting a character. In the example of 
minimum distance identification caused by th e change of t he F f G ' 3 ^(A), the number of pixels indicating the continuity of 
feature of the character by the blurring or distortion of the pixels in each of the directions of left-to-right, top-to- 
characleSmag e can bTTrnnimized b £selec1jngjhejvliiral bottom, top-left- to -bottom-right and top-right-to-bottom- 
charac ter categories whi ch have sjrnilar_di^ the left is regarding the outlining pixels of the character 

featuxe^space. 45 image " m", and the direction having the maximum counted 

A second variation of the second embodiment of the value of the number of pixels is determined to be the 
document processing apparatus according to the present directional attribute of the pixels. If the outlining pixels 
invention is now described. As described above, in the having the maximum counted value in the left-to-right 
second embodiment and its first variation, the words accept- direction are collected, the features shown in FIG. 32(B) can 
able as the particular language can be extracted from the 50 be obtained. Similarly, if the outlining pixels having the 
document image without executing detailed identification maximum counted value of the pixels in the top-to-bottom 
process for all character types. However, the word has been direction are collected, the features shown in FIG. 32(C), in 
extracted as a combination of the similar character catego- toe top-left-to-bottom-right direction, the features shown in 
ries so far; accordingly, there remains indefiniteness, and in FIG - 32 (D), and in the top-right-to-bottom-left direction, the 
some cases plural character words correspond to the repre- 55 features shown in FIG. 32(E) can be obtained respectively, 
sentative character code string extracted as one word. For Such features of the directional attribute should be stored in 
instance, two character words " JM& M and a SEffi" correspond ^ dcta i led identification dictionary, 
to the category word accepted as a noun With the , A ° f 0Utline dircction contribution degree dis- 

above-described construction, the two words and <n p h " /" aDd P rint ^ Ch ? ese Characters Recognition by 

UttM „ . J ! ' ° woras *™ and 60 Peripheral Direction Contnbutivity Feature", Hagita et al., 

SIS are extracted as the independent words in the docu- The Transactions of the Institute of Electronics, Information 

ment image, and it is impossible to determine which of them and Communication Engineers D, Vol J66-D No 10 pp 

is really described in the document image. 1185-1192, October, 1983 may be adopted. Being different 

lo resolve such problem, in the second variation, the from the peripheral feature used in the similar character 

feature of each character is examined in detail to uniquely 65 classifying element 103 indicating the shape of the 

determine a character. In this case, it is unnecessary to character, each of these features indicates complexity, direc- 

execute comparison of features of about approximately tion and continuity of hnes inside of a character, whereby 
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more detailed features of the character are represented. Of As it is understood from the equation, it is actually 

course, other features or a combination of plural kinds of unnecessary in the details identifying element 110 to carry 

features may be used. out comparison of features with those of all character types 

FIG. 33 is a flow chart showing an example of procedures in each category, and it is sufficient to compare features only 

of preparation of the detailed identification dictionary in the 5 with those of the character types in the word candidates 

second variation of the second embodiment of the document having possibilities of the word. Even in the worst case, the 

processing apparatus according to the present invention. number of comparison for one category is the number of 

Here, the feature amount to be used is represented as similar characters in the category. 

detailed feature, and is not described as a specific feature The reason why the accumulated value of the difference 

amount. First, in step 151, one similar character category is 10 between the features is used is that it can be considered a 

selected from the similar character category table. Next, in word other than the word candidates (in some cases, a word 

step 152, an image representing the character type belonging unacceptable as a particular language) is generated influ- 

to the selected similar character category is extracted from enced b y blurring or distortion of character image in the case 

the image in the training sample. In step 153, the detailed where the detailed identification of each character image is 

feature is extracted from each character type in the character „ carned out and a word is generated by combination of 

image extracted in step 152, and in step 154, an average of characters each of which has the best certainty in the 

the detailed features is calculated. In step 155, grouping of Categ °7'™ e WOr f , det f C ted m , * c cate ^ word detecting 

the feature amounts per every similar character^ategory is - ??f J f 7 are at lea ?i T* P t i* % *f l "T^ 

• j « « , - J i,T, and therefore it is possible to make only the word candidates 

earned out and the grouped feature amounts are added to the detected m (he ^ worf detec [ m elemen , m m 

detailed identification dictionary 115. By executing these 20 objec , of identification. 

processes for each similar character category, the detailed ^ FIG. 35 is a flow chart showing an example of the process 
idenuficationdicuonary is generated. V% the details identifying element in the second variation of 

FIG. 34 illustrates an example of the detailed identiflca- me embodiment of the document processing appa- 

tion dictionary in the second variation of the second embodi- ratus according to the present invention. The example of the 

ment of the document processing apparatus according to the 2S proccss in lhe detai]s ide ntifying element 110 described 

present invention. The detailed identification dictionary 115 above is 6xplained based on nG 35 . First> m ^ 

may be constituted per every similar character category, by m a caleg Qrv WQrd Sc to b^a pro cessing objectjs^elected 

the character code belonging thereto and its detaded feature anH ihe'-^fn ^f of candidal T ofSff^ characteTwora* N 

vectors, as shown m FIG. 34, for example. Similar to the C orre sp'oniin^J lis_cAte g orv word Sciswimled. The 

similar character category table, character code-category 30 leng th of the categor y word Sc, USc) is deter mined^ be W. 

correspondence table, category word dictionary 113 or code MoTeover, a storage region AWusedTor processing is 

conversion table 114, the detailed identification dictionary sealted and initialed, and simultaneously, the variable i is 

115 can also be implemented in another device in advance ^ to t b initialization. At this point, if the number of 

and constructed so as to use its own data only. candidates of the character word N is 1, the category word 

As described above, in the second variation, the category 3S k exc i uded from the processing object and directly coo- 
word acceptable as a particular language is extracted from vertcd - mto the character wold b thc cate gory word con- 
the representative character code stnng, and finally the verting demcnt m |he ^ word Sc determined 
character word is obtained by the category word converting to be thc process ing object is segmented in the document 
element 107 At this time, there are some cases where ^ j, ^ Me t0 , he jtion of tbe processing 
conversion of one category word into plural character words w object ca , e gory word Sc in the document image by storing 
f,J? f } D ^f* a r case ', the details identif y in g element the positional information when the category word is seg- 
110 is caUed to identify each character image in detail, and mented by tbe category word detecting element 106, and , 
thereby a character code is uniquely determined, that is, a referring thereto 

character word is determined. Next, in step 162, the character image of the i-th character 

In the deuilsidentifymg element 110, the character word 45 is segmented. The position of each character image in tbe 

is determined (according to the processes as follows. Now it cat worf ^ ^ ^ b st • the ^oozl 

is a^umedjhat^ategm^^ information simultaneously when each character image is 

woidsgSc^ndthel^of the category wo rd Sc is USc). assigned to m6 cbaracter category m ^ pseudo cbaract e r 

It * ako assumed that the nu^rofjhMygfwoas.into recognizing element 105, and referring thereto. From the 

wEtehnhe^gory_word^ Sc ; canbrg557erted is_Nand.the- 50 charact er image segmented in this way, the features as same 

n-Oh^rd^^C^^^Swn. The order is assigned to JS tbose ^ for generating the detailed identification 

the word candidate, but it has no specific meanmg such that dic ,ionary 115 are extracted in step 163, which is assumed 

a smafler number of candidate has a better possibility to be to ^ tbe features x In steps 164-167, the extracted features 

a word, and it is simply assigned according to the order of and me det ailed features of the i-th character of each word 

the dictionary as a matter of convenience. It is further S5 cand idate are compared, and the differences are accumulated 

assumed that i-tb character in the character word Swn is m me storage region ^ every word ^di^e. That ^ m 

represented as Swn(i) Then a character word which makes step m , he variablc j set to 1, and in step 165, the 

the value An of the following equation minimum is output- diffe r en ce F (X, Swj (i) ) between the features X extracted 

ted as an ultimate result of the category converting element m step 163 and me detailed features Swj q of the {jfh 

60 character is calculated and accumulated in ATj]. In step 166, 
the variable j is incremented by one. In step 167, it is 

IS,' ' determined whether the value of the variable j exceeds the 

An - 2.1 SW, ' W) number of character word candidates N or not, and the 

process returns to step 165 to be continued until the value of 

65 j exceeds the number N. Thereby the differences of the 

wherein F (X, M) is the difference between the features of features of characters from the first one to the i-th one are 

an inputted unknown character X and a certain character M. accumulated in the storage region A[1]-A[N]. 
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Further, in step 168, the variable i is incremented by one. width is exceeded; therefore the integration is not executed. 

In step 169, the value of the variable i is compared with the Accordingly, two possible preceding interpretations are 

length W of the category word, and if i is not more than W, required to be retained for the same character image region, 

the process returns to step 162 to be continued. In this way, As a result of similar character identification about each of 

the processes in steps 162-169 are repeated to the last 5 the interpretations in the pseudo character recognizing ele- 

character, and thereby the accumulation value of differences ment 105, the partial character al, a2 and partial characters 

of features of each character is stored in the storage region ala2 are converted into the representative character codes 

^i 1 ^ l V^l ^ wo ^ d t f andidate - . — _ i"xr i " E "> " P " aQd " p P"> respectively, and stored in the storage 

In step 170, the values of the storage regions A[1]-A[N ] device m In FIG. 36, the representative character codes 

are compared with one another, and an address C of the arc shown ^ round bracketS: ^ symbols q i n the figure 

storage region having the minimum value is obtamed. In are pauses of interpretation of character segmentation, 

step 171, the word candidate SwC corresponding to the Regarding bl and the partial characters subsequent 

address C is extracted, and outputted as a character word thereto, sim ilarly, there are four possible mterpretations 

which has the best certamty. ([bl], [b2], [b3]), ([blb2], [b3]), ([bl], [b2b3]) and 

, Here ^ toeenmpk is shown m which the accumulation ([blb2b3]), wherein the partial characters in the square 

valueofthedifferencesbetween the featuresof the unknown brackcts £ ] arc rcgarded ^ one cbara cter. Therefore, the 

characters and those in the dictionary is used as the evalu- processes in the pseudo character recognizing element 105 

ation function of the word, but it is also possible to obtain are carried out in the same way. The combinations of partial 

the certainty of the unknown characters statistically by characters regarded as one character [bll[b2], [b3],[blb2], 

utilizing the statistical information such as the distribution of [ b2 b3] and [blb2b3] are converted into the representative 

the detailed features of the training sample obtained in 20 character codes « « v , « v , « 3I „ „ lt „ flnH « ai „ 

generating the dictionary, and make the accumulation of the cftarac ' cr codes * ; 1 ; 1 > J} > 'J and m , 

values th? evaluation function of the word. respectively, and all of these interpretations are stored in the 

As described above, if the category word detected in the i! ge deV1Ce , * . , , . . 

category word detecting element 107 can be converted into ™ 6 re P resentatlve character code corresponding 

plural character words, precise word extraction is available 25 to " mm " obtamed as mentioned above are represented U [EP, 

by executing the detailed identification for the detected °P] [ffl[H> 'j]t l f ®J]" here. The contents in the square 

category words. Furthermore, it is assured that the words brackets [ ] can represent the plural interpretations of seg- 

acceptable as a particular language can be detected in the mentation position, if any, within a specific range in the 

category word detecting element 107 by limiting the object character image. For example, two vertical strokes in the 

of the detailed identification to the combination of characters 30 right side of U W can be represented so as to be regarded as 

of the word candidates. one representative character code, as well as two represen- 

Next, a third variation of the second embodiment of the tative character codes, 

document processing apparatus according to the present In the case where there are plural interpretations of 

invention is described. In each of the above examples of the segmentation when the category word dictionary 113 is 

second embodiment, it is assumed that there is no error in the 35 searched, it is examined whether the representative character 

process of character segmentation. However, actually, many code string of each range of the interpretations exists in the 

eriors occur in the process of segmentation as described in category word dictionary 113 or not, and all that have 

the second variation of the first embodiment. In the third possibilities are left. In the above example, regarding the 

variation of the second embodiment, an example coping character " BT, it is firstly examined whether the represen- 

with such errors occurring in segmentation is shown. As , f - , 4 , . . <<t7nJ , , KilB „ . . . 

same as the second variation of ihe first embodiment, the charac ^ .code string «EF* and « PP» exist in the 

example shown in FIGS. 16(A) and 16(B) is also taken as an Cg0I 7 ^ ^ U3 or DoL * f both 1 of th ? m ™ 

example here. determined to exist, they are retained as those having 

FIG. 36 illustrates an example of relations between the possibilities of existence. Next, as to the character the 
segmented character strings in the third variation of the 45 candidate s of the representative character code which fol- 
second embodiment of the document processing apparatus lows to each of "EF' and " PP" are " Jfc", " 31" and "«T, and 
according to the present invention. As described above, in ft is examined whether these candidates are capable of 
the case of the example of " **MT shown in FIG. 16(A), connecting thereto or not by using the category word die- 
there are only the spaces between the characters as to - *» 50 , U3 ™ CD " EP " iS ™ *^ * ° f ^ 
»■«_ • « candidates and becomes a word by itself. The representative 

and * . But there is one candidate of segmentation posi- . j A . « ,»..., 

...... u » i - ■ 7. character code stnng "®m exists in the category word 

tion in the cbaracter fSfl only consisting of the white pixels dictionary 113 and extracted as a word; accordingly, the 

in the vertical direction, and in the character "BT, there are candidates of word to follow are similarly attempted to 

two candidates of such segmentation position. Needless to match with it, and further the possibilities of connection are 

say, there is a segmentation position between these two 55 checked by the part of speech connection dictionary. The 

characters; consequently, total five partial characters (al, a2, interpretation "EF* is regarded as one word, and the possi- 

bl, b2, b3) are obtained. Integration of these partial char- bility of connection of the word beginning with the next 

acters into complete characters is now attempted. Because character is examined. Here, the character categories having 

nothing can be integrated with the characters " and " 6Q the possibility of connection are " JR", "31" and U #J". The 

they are identified with the similar character categories in words beginning with each of the three categories are 

the pseudo character recognizing element 105 and converted extracted and relations of connection in view of part of 

into the representative character codes and " 5C" and " g", speech are examined. If there is no such word connecting to 

respectively. There are two possibilities of interpretation of "EF', the possibility of interpretation of "EF* is rejected and 

the character " en**: the partial characters al and a2 are dealt 65 " M" is left. 

with as two characters; and they are regarded as one char- As a more complex example, a character string 

acter. If a2 and bl are integrated, the threshold value of the "NMRtC fclt a" is now considered. Here, the characters 
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"N", "M" and "R" are half pitch. Consequently, there is a in step 187. If it is determined to be succeeded, whether the 

fear that these adjacent English characters are incorrectly end of the representative character code is the border of a 

integrated with each other and incorrectly recognized as a word or not is further determined in step 188. If it is not the 

Chinese character. Besides, there is a segmentation position border of a word, the process returns to step 182 and 

in a character 5 matching with the category word dictionary is continued 

FIG. 37 is another example of relations between the u ^il a word can be extracted. If the matching up to the 

segmented character strings in the third variation of the border of a word is succeeded, and thereby a candidate of the 

second embodiment of the document processing apparatus category word is obtained, relation of connection regarding 

according to the present invention. The integration can be P^ 1 of speech to the category word candidate immediately 

assumed as follows: "NM"; "MR" and "R" phis the left side 10 obtained is examined to determine whether it is possible to 

stroke of "tc" as one character. It is further assumed that connect with each other in step 189. If possible, it is left. In 

"WIT, "a" and "51" are obtained as the representative StCp 19 °' P^cess completes to the end of the 

character codes corresponding to the three integrated char- * Th " * P ^ Vm 

r & & is still in the course of the processing unit, the process 

acters. For the character " t:", it is assumed that a represen- 15 retliros to 182 to extract the next category word and the 

tative character code and "1" and " for each of process is continued. If the process is completed to the end 

separated partial characters are obtained. Then the represen- of the processing unit, the string of the category words 

tative character code strings which permit plural interpreta- obtained so far can be" a candidate of a word; therefore the 

uon of character segmentation are represented as [NIM[R obtained category word string is outputted in step 191. 

\Z f [mcl]]], [fll[R[[^, 1], 31^]]]. In the case of 20 If it is determined in step 187 that the matching with the 

actual matching, category transition data can be prepared by category word dictionary in step 186 failed, the process 

regarding a representative character code as a node and a returns to the hierarchy higher by one layer than before for 

connection between representative character codes where reference, where plural interpretations exist, and matching 

transition is possible as an arc based on the plural interpre- otner P atas from sle P 185 are also executed. In the case 

tations of segmentation shown in the square brackets [ ] in 25 where it is determined in step 189 that connection of parts 

the notation of the representative character code strings. °f speech is not permitted, also, possibilities of the word 

FIG. 37 shows the category transition data which makes strings supposed hitherto are rejected and the transitions 

partofcharacterstring tt NMR^j3"anobjcaThcmatehing subsec l uent to«<*<> a « excluded from the object of 

with the category word dictionary 113 is executed from the processing, and thereby processes are not carried out. Then, 

top of the category transition data. For example, it is 30 m ste P 1W > the P rocess returns to the hierarcfl y by 

tU . u>k rxj fT> „ , v u m „ , v j , 0Iie l aver before, where plural interpretations exist, and 

assumed that NMR (noun), JSl (noun) and fa stem c . \ oc f . . ^ 

, „ w , v _ nt . ,\ , .V . • » . 1j • \u processes from step 185 are continued in the same way. 

oi a verb) are attempted to be matched as the words in the r- r <, . , , / 

' , .... * hi n ur*- t !r V Furthermore, even after the processes are completed to the 

category word dicUonary 113. Possibilities of connection to A * * u • j . j 

♦u u t , 3 , . , . . „ , end of the processing unit and the category word string is 

the subsequent words are checked by the part of speech , c n „ tn „„ aA ;„ i0 r <u a ♦ * mi iu 

j.^. n i -i • i. j K 35 outputted in step 191, the process proceeds to step 193 for 

connection dictionary. For example, if it is turned out that a * • >- c .l ■ L ' 

J * * determmation of other possibilities, in which the process 

"NMR" can be connected by "ic" (postpositional particle returns t0 the hierarchy higher by one layer than before 

assigning the case), the category word " IIL" cannot be where plural interpretations exist, and the process proceeds 

connected by a word beginning with "R" or "31" and the to ste P 185 and is continued. If another possible category 

word - Mr cannot be connected by a word beginning with 40 word strin S » obtained > * * also outputted in step 191, of 

course 

the representative character code u Z ". Then, as a result, the fe »u * j . .t. • 

r w , « there remains no unprocessed transition path m the 

candidates of the word " 01" and "fL3l" are rejected, and variable P in step 185, it is determined in step 192 whether 

"NMR u" still remains as the candidate. Thus the candidate all transitions derived from the top of the hierarchy are 

of the correct segmentation position is left. 45 examined or not. If there is any transition which is 

FIGS. 38 and 39 are flow charts showing an example of unexamined, the process proceeds to step 193 to make the 

the integration process of the segmented character strings in hierarchy higher by one layer than before where plural 

the third variation of the second embodiment of the docu- interpretations exist, and in step 185, unprocessed transition 

ment processing apparatus according to the present inven- path is sought and the process is continued. If the processes 

tion. At first, a representative character code string which is 50 for all transitions from the top hierarchy are completed, it 

a processing unit for the process in the pseudo character means that the processes for all paths in the category 

recognizing element 105 is developed to the category tran- transition data provided to one processing unit are com- 

sition data as described above. In step 181, the process is pleted. Thus the integration process is finished, 

started with the first position of the processing unit as a As described so far, though in the case there are divided 

focused point. 55 characters and plural category word candidates are found, it 

In step 182, it is determined whether there are plural is possible to gradually reduce the possibilities of the words 

transition paths to the next representative character code or in view of the relation of connection of parts of speech, and 

not, and if there are any, the process proceeds to lower thereby word extraction with high speed and with high 

hierarchy by one to refer to in step 183. In step 184, available precision becomes available . 

transition paths in the hierarchy currently referred to are set 60 Each of the above-described embodiments can also be 

to a variable P. implemented by a computer program. In such a case, it is 

In step 185, it is determined whether there are unproc- possible to store the program, a dictionary used by the 

essed transition paths in the variable Por not, and if there are program, table or the like in a storage medium readable by 

any unprocessed transition paths, one of them is focused and the computer. The storage medium is able to communicate 

the representative character code which is ahead of the 65 the contents described in the computer, which is in a form of 

transition path is matched with the category word dictionary a signal corresponding to the condition of transformation of 

in step 186. The matching is succeeded or not is determined energy of magnetism, optics, electricity or the like caused in 
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accordance with the contents described in the computer, to 
a reading device equipped with hardware resources of the 
computer. The storage medium may be, for example, a 
magnetic disk, optical disk, CD-ROM, or a memory built-in 
the computer. 

As it is clear from the above description, according to the 
present invention, full-text retrieval can be realized by only 
classifying the characters in the document image into the 
small number of similar character categories without iden- 
tifying those characters with the character code strings. The 
identification of the similar character categories in the 
present invention is executed by adopting feature vectors of 
the number of dimensions far less than those of conventional 
character recognition. Therefore, since it is sufficient only to 
identify the characters with the small number of similar 
character categories, it is possible to greatly accelerate the 
speed in extraction of the independent words which can be 
used as the keywords from the document image and regis- 
tration of the document image. 
\sVThe similar chara cter categories are retained as an 
attribute of the original document image, and when the 
retrieval is to be executed, each of the characters of the 
retrieval keyword is converted into the stringjrfjhesimilar 
characters to be used for retrieval 7Because plural characters 
belong-to-a'similar character category, there is a possibility 
that character strings corresponding to the_ rerjresentative 
char acter code strings conv ert ed from the_keyword i nclude 
thoseot her than the desirable key word. However, if it is 
taken into consideration that the retrieval keyword normally 
consists of plural characters and moreover, plural keywords 
are specified, there are actually a few cases where character 
stings other than the desirable keyword are obtained as a 
result. In contrast, the precision of classification into similar 
character categories is extremely high in comparison with 
the number of errors in character recognition in the docu- 
ment image, and consequently, the retrieval can be carried 
out with little oversight. Furthermore, since the method of 
ordinary full-text search can be used without changing, there 
is an advantage of executing processes as same as those of 
ordinary electronic document retrieval. 

Moreover, by extracting a word from the similar character 
category strings with referring to the category word 
dictionary, possibility of retrieving meaningless character 
strings can be reduced, and further the precision of retrieval 
can be improved by taking the possibility of connection 
between the words in view of part of speech into consider- 
ation. In some cases, plural different words are represented 
by the same similar character category string, but in such 
cases, it may be determined which character in the category 
should be adopted by more detailed identification. After the 
category words are extracted, if some words corresponding 
to at least a part of the extracted category words are made to 
be the keywords, it is unnecessary to execute a specific 
process on the retrieval keywords, and accordingly the 
keyword retrieval used in the ordinary database is available. 
That is, there is an advantage that the data of the electronic 
document and the document image can be treated equally. 

The foregoing description of preferred embodiments of 
this invention has been presented for purposes of illustration 
and description. It is not intended to be exhaustive or to limit 
the invention to the precise form disclosed, and modifica- 
tions and variations are possible in light of the above 
teachings or may be acquired from practice of the invention. 
The embodiments were chosen and described in order to 
explain the principles of the invention and its practical 
application to enable one skilled in the art to utilize the 
invention in various embodiments and with various modi- 
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fications as are suited to the particular use contemplated. It 
is intended that the scope of the invention be defined by the 
claims appended hereto, and their equivalents. 
What is claimed is: 

1. A document processing apparatus for processing Asian 
language text comprising: 

character category storing means for storing a category of 
similar character made by classification of characters 
based on an image feature of each character with 
relation to the image feature; 
text region extracting means for segmenting an image of 
every character in said Asian language text of an 
inputted document image; 
pseudo character recognizing means for classifying the 
image of every character segmented by said text region 
extracting means into the category stored in said char- 
acter category storing means based on the image fea- 
ture related to the category; 
pseudo character recognition result storing means for 
storing the category into which the image of every 
character is classified by the pseudo character recog- 
nizing means with relation to the inputted document 
image; 

keyword converting means for converting each character 
in a retrieval expression inputted for retrieval into the 
nearest category stored in said character category stor- 
ing means; and 
document retrieving means for retrieving a document 
image having a category matching the category gener- 
ated by converting the retrieval expression by said 
30 keyword converting means from said pseudo character 
recognition result storing means. 

2. The document processing apparatus according to claim 
1, wherein the category stored in said character category 
storing means is generated by classification of characters by 

35 clustering feature vectors of a character image. 

3. The document processing apparatus according to claim 
1, wherein the category stored with relation to the document 
image in said pseudo character recognition result storing 
means is stored as a bi-gram table storing an identifier of a 

40 document including a key which is a category of two 
adjacent character images in the document image, and said 
document retrieving means retrieves the category converted 
by said keyword converting means from the bi-gram table. 

4. The document processing apparatus according to claim 
1, wherein said character category storing means classifies 
one character into plural categories in some cases, and said 
keyword converting means converts one retrieval keyword 
into all categories stored in said character category storing 
means. 

5. The document processing apparatus according to claim 
1, wherein said character category storing means classifies 
one character into plural categories in some cases, and stores 
a probability of classification of the character into each 
category in such cases, and said document retrieving means 
retrieves a document image from said pseudo character 
recognition result storing means in accordance with the 
probability stored in said character category storing means. 

6. The document processing apparatus according to claim 
1, wherein, if there is a plurality of interpretations of 
character segmentation, said text region extracting means 
executes segmentation for all interpretations; said pseudo 
character recognizing means classifies the all results of 
segmentation executed by said text region extracting means 
into categories; and said pseudo character recognition result 
storing means stores all categories classified by said pseudo 
character recognizing means with relation to the document 
image. 
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7. A document processing method for processing Asian 
language text used in a document processing apparatus 
having character category storing means for storing a cat- 
egory of similar character made by classification of charac- 
ters based on an image feature of each character with 5 
relation to the image feature, comprising the steps of: 

segmenting an image of each character in said Asian 
language text of an inputted document image; 

classifying the segmented image of every character into 
the category stored in said character category storing 10 
means based on the image related to the category; 

storing the category into which the image of every char- 
acter is classified with relation to the inputted document 
image; l5 

converting each character in a retrieval expression input- 
ted for retrieval into the nearest category stored in said 
character- category storing means; and 

retrieving a document image having a category satisfying 
the retrieval expression which has been converted into 20 
the category. 

8. A storage medium having a computer readable program 
and a dictionary for use with a computer, said dictionary 
being a character category dictionary storing a category of a 
similar character made by classification of a character based 25 
on an image feature of each character in an Asian language 
text, said computer readable program comprising: 

program code means for causing said computer to execute 
a text region extraction process of segmenting an image 
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of every character in said Asian language text of an 
inputted document image; 

program code means for causing said computer to execute 
a pseudo character recognizing process of classifying 
the segmented image of every character segmented by 
the text region extracting process into the category 
stored in said character category dictionary based on 
the image feature related to the category; 

program code means for causing said computer to execute 
a pseudo character recognition result storing process of 
storing the category into which the image of every 
character is classified by the pseudo character recog- 
nizing process with relation to the inputted document 
image; 

program code means for causing said computer to execute 
a keyword converting process of converting each char- 
acter in a retrieval expression inputted for retrieval into 
the nearest category stored in said character category 
dictionary; and 

program code means for causing said computer to execute 
a document retrieving process of retrieving a document 
image having a category satisfying the retrieval expres- 
sion which has been converted into the category from 
the document images stored by the pseudo character 
recognition result storing process with relation to the 
category. 

* * * * * 
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